Some Pandas functions I have learned during leetcode

Pandas Basics Summary I


Continue with my previous post, below are the Leetcode questions about Pandas

The list of questions about Pandas

  1. Create a DataFrame from List
  2. Get the Size of a DataFrame
  3. Display the First Three Rows
  4. Select Data
  5. Create a New Column
  6. Drop Duplicate Rows
  7. Drop Missing Data
  8. Modify Columns
  9. Rename Columns
  10. Change Data Type
  11. Fill Missing Data
  12. Reshape Data: Concatenate
  13. Reshape Data: Pivot
  14. Reshape Data: Melt
  15. Method Chaining

Setup Pandas

Pandas is all about manipulating DataFrame. Imagine DataFrame is a table.

To use Pandas, import it from pd:

import pandas as pd

Pandas Functions

# Create DataFrame from a list
list = [[1, 'Apple'], [2, 'Banana'], [3, 'Orange']]
df = pd.DataFrame(list, columns=['id', 'fruit'])


# Create DataFrame from an object
# Each property is a list
obj = {
    'name': ['Piper', 'Grace', 'Georgia', 'Willow', 'Finn', 'Thomas'],
    'salary': [4548, 28150, 1103, 6593, 74576, 24433]
}
pd.DataFrame(obj)


# Get number of rows
len(data)


# Get columns from a dataframe 
df.columns


# Get row with indexer, iloc = integer location 
df.iloc[n][column_name]
# e.g. df.iloc[0]['id']
df.iloc[0, 0]  # Selects the first row and first column value
df.iloc[0:2]   # Selects rows from position 0 to 1 (end position 2 is excluded)


# Get row by labels
df.loc['index_label']
df.loc['index_label', 'column_name']
df.loc['index_label_start':'index_label_end']


# Get shape (rows x columns) from a dataframe
(rows, cols) = df.shape


# Get top N rows
df.head(3)


# Filter rows
df[df['column_name'] == val]
# e.g row = students[students['student_id'] == 101]


# Select fields from rows
df[['column_name1', 'column_name2']]


# Create new column from existing column
df['new_column'] = df['old_column'] * 2


# Modify all values in a column
df['column_name'] = df['column_name'] * 2


# Iterate rows
for row_id, row in df.iterrows():


# Drop duplicates
df.drop_duplicates(subset='column_name', keep='first')
# drops rows that drop_duplicates() returns TRUE, see the test case below for details
# keep=False, Mark all duplicates TRUE
# duplicates_false = df.duplicated(subset='email', keep=False)
# 0    False
# 1     True <- Both marked true as keep set to false, don't keep any duplicates
# 2     True <-

# duplicates_first = df.duplicated(subset='email', keep='first')
# keep='first', Mark all duplicates TRUE except the first occurance
# 0    False
# 1    False <- Marked false to keep because it is the first
# 2     True

# duplicates_last = df.duplicated(subset='email', keep='last')
# keep='last', Mark all duplicates TRUE except the last occurance
# 0    False
# 1     True
# 2    False <- Marked false to keep because it is the last


# Filter rows by column not null
df[df['column_name'].notnull()]
# returns a series of id, boolean
# since TRUE = 1, FALSE = 0, counting not null rows can be:
df['column_name'].notnull().sum() == 1


# Filter rows by column is null
df[df['column_name'].isnull()]


# Sum all rows from a column
df['column_name'].sum()


# Rename columns
df.rename(columns={'old_column1': 'new_column1', 'old_column2': 'new_column2'})


# Get column name
df.columns.tolist() == ['column_name1', 'column_name2']


# Change column type
df['column_name'] = df['column_name'].astype(int)


# Check column type
ptypes.is_float_dtype(df['column_name'])


# Fill data for null
df['column_name'] = df['column_name'].fillna(value)


# Append / concat dataframes
pd.concat([df1, df2])


# Sort rows by a columns
df.sort_values(by='column_name', ascending=False) # or
df.sort_values(by=['column_name1','column_name2'], ascending=False) # or


# Get the indexes
df.index

# Setting index
df.set_index('name', inplace=True)
# if inplace = False it will create a new dataframe


# Reset index to default integer index (index can be set to a string field for example with set_index)
df.reset_index(drop=True)
# if drop = false, the original index column will preserve, column will be dropped entirely otherwise 


# Reshape from long to wide format (Pivot)
# values to columns
# e.g use the month as index column, then each unique value in city column become a new column, then aggregate values from temperature column. 
df.pivot(index='product', columns='quarter', values='sales')
# index: Column to use to make new frame’s index.
# columns: Column to use to make new frame’s columns.
# values: Column(s) to use for populating new frame’s values.

# before
# product     | quarter   | sales
# Umbrella    | quarter_1 | 417
# SleepingBag | quarter_1 | 800
# Umbrella    | quarter_2 | 224
# SleepingBag | quarter_2 | 936
# ...

# after
# product    | quarter_1 | quarter_2 
# Umbrella   | 417       | 224       
# SleepingBag| 800       | 936       


# Reshape from wide to long format (Melt)
# columns to values
pd.melt(report, id_vars=['product'], var_name='quarter', value_name='sales')

# Sample DataFrame
data = {
    'product': ['Umbrella', 'SleepingBag'],
    'quarter_1': [417, 800],
    'quarter_2': [224, 936],
}
# before
# product    | quarter_1 | quarter_2 
# Umbrella   | 417       | 224       
# SleepingBag| 800       | 936       

# after
# product     | quarter   | sales
# Umbrella    | quarter_1 | 417
# SleepingBag | quarter_1 | 800
# Umbrella    | quarter_2 | 224
# SleepingBag | quarter_2 | 936
# ...