Proper-Casing Excel Data With Python

Katie - Jan 10 '19 - - Dev Community

Helen Anderson’s article “Excel Series: Formulas and Functions” went around Twitter today, and her first steps were so beautifully simple I thought, “Hey, let’s see what they look like in Python for a little compare-and-contrast.”

I need to get back to work, so I only got as far as her very first example, =PROPER(A1), but that alone was quite interesting.

The Sample Data

First, I set up a very small table to work with and saved it inside a variable I decided to call df:

fname company
anush apple
benita boiron
cathy cvs
darweesh dell

In my case, the Python code to do this was as follows, because I included my sample data as part of the code:

from collections import OrderedDict
import pandas

initdata = OrderedDict([('fname', ['anush','benita','cathy','darweesh']),
          ('company', ['apple','boiron','cvs','dell']) ])
df = pandas.DataFrame(initdata)
Enter fullscreen mode Exit fullscreen mode

But usually it’d look more like this, if my data were inside a spreadsheet called sample.xlsx:

import pandas
df = pandas.read_excel('sample.xlsx')
Enter fullscreen mode Exit fullscreen mode

Displayed in my console with print(df) (after having first included a line of code pandas.set_option('expand_frame_repr', False) to prettify the output), it looks like this:

      fname company
0     anush   apple
1    benita  boiron
2     cathy     cvs
3  darweesh    dell
Enter fullscreen mode Exit fullscreen mode

Proper-Casing Everybody Is “Quicker” In Python Than In Excel

Python’s Pandas module (“module” = extended set of commands) is geared at using minimal code to repeat the same procedure against an entire set of data, such as a column, at once.

Consequently, whereas Helen’s =PROPER(A1) Excel formula will just populate whatever single cell you put it into with a “title-cased” copy of the contents of cell “A1” alone, Pandas commands are simplest when you add a whole new column to your spreadsheet with every first name title-cased, all the way down.

Note that in Pandas, you typically refer to columns by their names, not by letters of the alphabet indicating the order they come in.

This Python code:

df['fname_proper'] = df['fname'].str.title()
print(df)
Enter fullscreen mode Exit fullscreen mode

Modifies the contents of the table stored in the variable df and displays this output:

      fname company fname_proper
0     anush   apple        Anush
1    benita  boiron       Benita
2     cathy     cvs        Cathy
3  darweesh    dell     Darweesh
Enter fullscreen mode Exit fullscreen mode

Which is this data:

fname company fname_proper
anush apple Anush
benita boiron Benita
cathy cvs Cathy
darweesh dell Darweesh

A word on “speed”

Slicing bread with a chainsaw

I put “quicker” in quotes in this section’s heading because it’s debateable whether writing an entire Python program to add 1 column to a spreadsheet is really “quicker” than typing a formula into Excel in extra blank space and copying the formula down to the end of the page.

No, you know what? It’s not actually debateable.

In the real world, it’d be slicing bread with a chainsaw.

What I mean by “quicker” is that Python’s Pandas module’s commands are designed to facilitate thinking in large datasets, whereas when you think about how to use Excel commands, you work more “one cell at a time.”

Note: Where things do start to get legitimately faster in Python is with repetition or complexity. Examples:

  • You have a 300-column spreadsheet, 100 of which have the word “Name” in the column header. You need to create a copy of every single column that has the word “Name” in its header and title-case the contents of the copy. You’re going to save a lot of time by coding that rather than doing it by hand. And the Python I just showed you would be part of that code.
  • You were already processing your spreadsheet with Python because you were doing some other sort of complicated operation (like a multi-column VLOOKUP between spreadsheets). Why not throw in one more operation to your Python program, rather than leave yourself work to do manually in Excel after Python finishes the “difficult” work? I often find this to be the case when, say, combining “people” data with “campaign” data to generate a “CampaignMember” file to insert into Salesforce.

Only adding “Anush”

But perhaps I’m being a bit presumptuous in assuming Helen ever intended to “copy down” =PROPER(A1) to title-case more rows.

Maybe she really did just want to make a title-cased copy of a single cell.

Let’s look at two very similar approaches to this in Python.

The differences between them are negligible for a beginner – there are simply slightly different ways you have to type the “right side” of the = command depending on how you type the “left side,” due to a notion called “data types."

What the two approaches have in common is that they’re both complicated.

(Python’s “Pandas” commands are not optimized for easy typing when it comes to doing single-cell operations like this. That’s considered the exception, not the norm, for what most people want to do.)

Also, they both refer to what was, in Helen’s example, “row 1,” as row “0.”

Approach 1

This Python code:

df.loc[0,'fname_1st_row_only_proper_v1'] = df.loc[0,'fname'].title()
print(df)
Enter fullscreen mode Exit fullscreen mode

Modifies the contents of the table stored in the variable df (which, remember, now also has a fname_proper column) and displays this output:

      fname company fname_proper fname_1st_row_only_proper_v1
0     anush   apple        Anush                        Anush
1    benita  boiron       Benita                          NaN
2     cathy     cvs        Cathy                          NaN
3  darweesh    dell     Darweesh                          NaN
Enter fullscreen mode Exit fullscreen mode

Which is this data:

fname company fname_proper fname_1st_row_only_proper_v1
anush apple Anush Anush
benita boiron Benita
cathy cvs Cathy
darweesh dell Darweesh

Approach 2

This Python code:

df.loc[[0],'fname_1st_row_only_proper_v2'] = df['fname'].str.title()
print(df)
Enter fullscreen mode Exit fullscreen mode

Modifies the contents of the table stored in the variable df (which, remember, already has 4 columns before we run the command) and displays this output:

      fname company fname_proper fname_1st_row_only_proper_v1 fname_1st_row_only_proper_v2
0     anush   apple        Anush                        Anush                        Anush
1    benita  boiron       Benita                          NaN                          NaN
2     cathy     cvs        Cathy                          NaN                          NaN
3  darweesh    dell     Darweesh                          NaN                          NaN
Enter fullscreen mode Exit fullscreen mode

Which is this data:

fname company fname_proper fname_1st_row_only_proper_v1 fname_1st_row_only_proper_v2
anush apple Anush Anush Anush
benita boiron Benita
cathy cvs Cathy
darweesh dell Darweesh

Takeaway

Lessons I hope you’ll take away from this blog post are:

  1. Certain spreadsheet-modification tasks really aren’t much more complicated to code in Python than they are in Excel, once you’re familiar with them. And now you know one! (Adding a new title-cased copy of a column.)
  2. Pick the right tool for the job. Not only is there overhead to coding with Python vs. clicking cells in Excel and editing their formulas, but some commands are simpler than others in Python, just like in Excel.

Full code

Here’s the full script, which you can copy and paste into a Repl.it editor to run for yourself:

from collections import OrderedDict
import pandas
pandas.set_option('expand_frame_repr', False)

def p(x):
    print(x)
    print('---DIVIDER---')

initdata = OrderedDict([('fname', ['anush','benita','cathy','darweesh']),
          ('company', ['apple','boiron','cvs','dell']) ])
df = pandas.DataFrame(initdata)
p(df)

# "=PROPER(A1)" imitation #1 of 3 (entire "fname" column)
df['fname_proper'] = df['fname'].str.title()
p(df)

# "=PROPER(A1)" imitation #2 of 3 (single cell approach #1 of 2)
df.loc[0,'fname_1st_row_only_proper_v1'] = df.loc[0,'fname'].title()
p(df)

# "=PROPER(A1)" imitation #3 of 3 (single cell approach #2 of 2)
df.loc[[0],'fname_1st_row_only_proper_v2'] = df['fname'].str.title()
p(df)
Enter fullscreen mode Exit fullscreen mode

And its full output:

      fname company
0     anush   apple
1    benita  boiron
2     cathy     cvs
3  darweesh    dell
---DIVIDER---
      fname company fname_proper
0     anush   apple        Anush
1    benita  boiron       Benita
2     cathy     cvs        Cathy
3  darweesh    dell     Darweesh
---DIVIDER---
      fname company fname_proper fname_1st_row_only_proper_v1
0     anush   apple        Anush                        Anush
1    benita  boiron       Benita                          NaN
2     cathy     cvs        Cathy                          NaN
3  darweesh    dell     Darweesh                          NaN
---DIVIDER---
      fname company fname_proper fname_1st_row_only_proper_v1 fname_1st_row_only_proper_v2
0     anush   apple        Anush                        Anush                        Anush
1    benita  boiron       Benita                          NaN                          NaN
2     cathy     cvs        Cathy                          NaN                          NaN
3  darweesh    dell     Darweesh                          NaN                          NaN
---DIVIDER---
Enter fullscreen mode Exit fullscreen mode
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .