Cleaning Data: wrangling data for a SageMaker pipeline

Julie Fisher - Nov 28 '22 - - Dev Community

Wrangling data into a usable form is a big part of any real world machine learning problem. When tackling these types of problems, several things are generally true:

First, the information to understand and/or solve the proposed business problem is usually stored in multiple locations or files. Whether it's multiple excel files, different tables within the same database, or completely different systems, having to pull information from multiple sources and then combine them is common.

Second, the data is usually in its raw form. This means the data will need to be featurized. Free text needs to be turned into some kind of numeric representation, categories either need to be binned or encoded, numeric features may need to be standardized, lists of values turned into single number representations, or other custom transformations may need to be applied.

I chose the Insurance Company Benchmark (COIL 2000) dataset because it let's me demonstrate some of these real world problems:

  1. It's stored in multiple files
  2. It has both numeric and categorical features
  3. It's a binary classification problem

Objectives

Like a production level problem, this dataset needs to be combined and put into a usable form. Here are the data wrangling steps I'll complete in this post:

  • Combine train, test, and label datasets into a single full dataset
    • Production datasets don't usually come pre-split
    • Figuring out when, where, and how to split the data is an important production consideration
  • Update column names to something human readable
    • This makes my life easier when completing my exploratory data analysis to understand what the data is
    • Blind data processing is generally a bad idea
  • Return categorical features to their textual representation
    • Allows me to demonstrate both numeric and categorical handling as I move through creating a SageMaker Pipeline

Prerequisites

All prerequisites from the previous post, 1_read_from_s3.ipynb, still apply (SageMaker environment and applicable IAM roles are set up). The data that was loaded into S3 in the previous post will also be needed. Below is the code to pull the data from the UCI page and load it to S3 incase you deleted the files from the last post.

# Only run this if you deleted the output from the previous post or didn't run it at all
# If you run it accidentally, oh well, it'll just overwrite the previous files
import pandas as pd
import sagemaker

session = sagemaker.session.Session()
bucket = session.default_bucket()
prefix = '1_ins_dataset/raw'

train_uri = 'https://archive.ics.uci.edu/ml/machine-learning-databases/tic-mld/ticdata2000.txt'
test_uri = 'https://archive.ics.uci.edu/ml/machine-learning-databases/tic-mld/ticeval2000.txt'
gt_uri = 'https://archive.ics.uci.edu/ml/machine-learning-databases/tic-mld/tictgts2000.txt'
cols_uri = 'https://archive.ics.uci.edu/ml/machine-learning-databases/tic-mld/dictionary.txt'

train = pd.read_table(train_uri, header=None)
test = pd.read_table(test_uri, header=None)
ground_truth = pd.read_table(gt_uri, header=None)
columns = pd.read_table(cols_uri, encoding='latin-1')

train.to_csv(f's3://{bucket}/{prefix}/train.csv', index=False)
test.to_csv(f's3://{bucket}/{prefix}/test.csv', index=False)
ground_truth.to_csv(f's3://{bucket}/{prefix}/gt.csv', index=False)
columns.to_csv(f's3://{bucket}/{prefix}/metadata/col_info.csv', index=False)
Enter fullscreen mode Exit fullscreen mode

Set AWS variables

This step needs to be completed pretty much any time I run anything in a SageMaker notebook. These variables are what allow me to talk to AWS resources and establish I have premission to use said resources.

import sagemaker
import boto3
import pandas as pd

session = sagemaker.session.Session()
bucket = session.default_bucket()
prefix = '1_ins_dataset/raw'
Enter fullscreen mode Exit fullscreen mode

Read in Data

First I read in all the data from S3. After trying multiple options in the last post, I decided I prefer using boto3.resource. For context on how this function was created, including information on why I used the objects.filter and split functions as well as why I put the data into a dictionary, please read the pervious post.

def read_mult_txt(bucket, prefix):
    s3_resource = boto3.resource("s3")
    s3_bucket = s3_resource.Bucket(bucket)

    files = {}
    for object_summary in s3_bucket.objects.filter(Prefix=prefix):
        if (len(object_summary.key.rsplit('.')) == 2) & (len(object_summary.key.split('/')) <= 3):
            files[object_summary.key.split('/')[-1].split('.')[0]] = f"s3://{bucket}/{object_summary.key}"

    df_dict = {}
    for df_name in files.keys():
        df_dict[df_name] = pd.read_csv(files[df_name])

    return df_dict
Enter fullscreen mode Exit fullscreen mode
df_dict = read_mult_txt(bucket, prefix)
Enter fullscreen mode Exit fullscreen mode

Combine data

As discussed earlier, when working on a real world dataset the data isn't generally split out into test and train for you. I want to account for this real world fact in my SageMaker Pipeline, so I need to add the test data labels to the test data, then join that with the training data. This will allow me to address when and how to split my train/test/validate data for myself.

Note, the original split only divided the data into train and test datasets. I'll be splitting the data into train, test, and validate. This will allow me to reserve data to evaluate my final model (validate data) after training (training data) and optimizing hyperparameters (test data). Splitting the data myself allows me to make these kinds of decisions.

In order to join all of the data into a single dataset, I need to know what columns are where. The training dataset will be my template because it already holds all of the data. Standard practice is to include the target/label as either the first column or the last. I can determine what columns are what by bringing in the data dictionary and associating the column names to the data.

Data Dictionary

Bringing in the column names from the data dictionary isn't as easy as it sounds with this particular dataset. The column names are in a text file that isn't conducive to a dataframe. By reading it in as a table, I can get something usable. However, upon pulling it in I found that there is other information in addition to the column names.

pd.set_option('display.max_rows', 50)

col_info_uri = f"s3://{bucket}/{prefix}/metadata/col_info.csv"
data_info = pd.read_table(col_info_uri)
data_info
Enter fullscreen mode Exit fullscreen mode
DATA DICTIONARY
0 Nr Name Description Domain
1 1 MOSTYPE Customer Subtype see L0
2 2 MAANTHUI Number of houses 1 – 10
3 3 MGEMOMV Avg size household 1 – 6
4 4 MGEMLEEF Avg age see L1
... ...
165 5 f 500 – 999
166 6 f 1000 – 4999
167 7 f 5000 – 9999
168 8 f 10.000 - 19.999
169 9 f 20.000 - ?

170 rows × 1 columns

For brevity, I only returned the top and bottom five rows of data. But when examining the full set the data dictionary clearly contains 170 rows which is comprised of five different sets of data. The information includes headers for several of the datasets and a dataset name to separate the different sections of information.

To make it more usable, I'm going to clean this up by separating the datasets, splitting out columns, and giving them their appropriate headers. I manually reviewed the data for the row indexing to separate out the different datasets.

Next I split the rows into the appropriate number of columns. Fortunately, most of this can be done by splitting on the space character and limiting the number of splits. If the column names had been more than one word long, the solution would have been more complicated. However, a simple string split gets me what I need and the column names are isolated in data_dict['feat_info']['Name'].

data_dict = {}
data_dict['feat_info'] = data_info.iloc[1:87, 0].str.split(n=2, expand=True)
data_dict['feat_info'].columns = data_info.iloc[0, 0].split(maxsplit=2)

data_dict['L0'] = data_info.iloc[89:130, 0].str.split(n=1, expand=True)
data_dict['L0'].columns = data_info.iloc[88, 0].split()

data_dict['L1'] = data_info.iloc[131:137, 0].str.split(n=1, expand=True)
data_dict['L1'].columns = ['Value', 'Bin']

data_dict['L2'] = data_info.iloc[138:148, 0].str.split(n=1, expand=True)
data_dict['L2'].columns = ['Value', 'Bin']

data_dict['L3'] = data_info.iloc[149:159, 0].str.split(n=1, expand=True)
data_dict['L3'].columns = ['Value', 'Bin']

data_dict['L4'] = data_info.iloc[160:, 0].str.split(n=1, expand=True)
data_dict['L4'].columns = ['Value', 'Bin']

for key in data_dict.keys():
    print(key)
    display(data_dict[key].head())
Enter fullscreen mode Exit fullscreen mode
feat_info
Enter fullscreen mode Exit fullscreen mode
Nr Name Description Domain
1 1 MOSTYPE Customer Subtype see L0
2 2 MAANTHUI Number of houses 1 – 10
3 3 MGEMOMV Avg size household 1 – 6
4 4 MGEMLEEF Avg age see L1
5 5 MOSHOOFD Customer main type see L2
L0
Enter fullscreen mode Exit fullscreen mode
Value Label
89 1 High Income, expensive child
90 2 Very Important Provincials
91 3 High status seniors
92 4 Affluent senior apartments
93 5 Mixed seniors
L1
Enter fullscreen mode Exit fullscreen mode
Value Bin
131 1 20-30 years
132 2 30-40 years
133 3 40-50 years
134 4 50-60 years
135 5 60-70 years
L2
Enter fullscreen mode Exit fullscreen mode
Value Bin
138 1 Successful hedonists
139 2 Driven Growers
140 3 Average Family
141 4 Career Loners
142 5 Living well
L3
Enter fullscreen mode Exit fullscreen mode
Value Bin
149 0 0%
150 1 1 - 10%
151 2 11 - 23%
152 3 24 - 36%
153 4 37 - 49%
L4
Enter fullscreen mode Exit fullscreen mode
Value Bin
160 0 f 0
161 1 f 1 – 49
162 2 f 50 – 99
163 3 f 100 – 199
164 4 f 200 – 499

Label column

Now I can determine which column holds the target label. This is the column I'll be trying to predict on for this machine learning problem. A peek at the first and last few rows gives me the column name and description (reminder, standard practice is to put the label as the first or last column).

display(data_dict['feat_info'].head(3))
display(data_dict['feat_info'].tail(3))
Enter fullscreen mode Exit fullscreen mode
Nr Name Description Domain
1 1 MOSTYPE Customer Subtype see L0
2 2 MAANTHUI Number of houses 1 – 10
3 3 MGEMOMV Avg size household 1 – 6
Nr Name Description Domain
84 84 AINBOED Number of property insurance policies
85 85 ABYSTAND Number of social security insurance policies
86 86 CARAVAN Number of mobile home policies 0 - 1

* Quick tip *

Did you forget what the dataframes within the dictionary are called? I forget what I call things all the time. You could spend all day scrolling around your notebook looking them back up, or you could just call the .keys() method. It's much easier to delete a Jupyter cell than constantly scroll around the notebook trying to find where you named something, then trying to return to where you're currently working.

If you've forgotten the name of a variable but can recall how it starts, you can type the first few letters then hit "Tab". Jupyter will auto-complete the variable name. If there is more than one variable that starts with those letters, it will give you a list to choose from.

# df_dict.keys()
Enter fullscreen mode Exit fullscreen mode
dict_keys(['gt', 'test', 'train'])
Enter fullscreen mode Exit fullscreen mode

A review of the UCI Repo page for the target variable shows the following:

The training set contains over 5000 descriptions of customers, including the information of whether or not they have a caravan insurance policy. A test set contains 4000 customers of whom only the organisers know if they have a caravan insurance policy.

Combining this information with the column names and descriptions, I now know that the target variable is the last column CARAVAN, which according to its description is an indicator of whether or not the customer has policies with the insurance company.

Column headers

The column headers of all the datasets are just the column index. This wouldn't be a problem except that when I concatenated the target label onto the test data, it brought with it the column name 0, which is already taken. I now have two columns labelled 0.

test_df = pd.concat([df_dict['test'], df_dict['gt']], axis=1)
test_df.head()
Enter fullscreen mode Exit fullscreen mode
0 1 2 3 4 5 6 7 8 9 ... 76 77 78 79 80 81 82 83 84 0
0 33 1 4 2 8 0 6 0 3 5 ... 0 0 0 1 0 0 0 0 0 0
1 6 1 3 2 2 0 5 0 4 5 ... 0 0 0 1 0 0 0 0 0 1
2 39 1 3 3 9 1 4 2 3 5 ... 0 0 0 1 0 0 0 0 0 0
3 9 1 2 3 3 2 3 2 4 5 ... 0 0 0 1 0 0 0 0 0 0
4 31 1 2 4 7 0 2 0 7 9 ... 0 0 0 1 0 0 0 0 0 0

5 rows × 86 columns

In order to concatenate the new test dataset with the training dataset, I may as well rename the columns with their appropriate column names now. It will make concatenating the test data onto the training data much easier. This is easily done by turning the Name column from the feat_info dataset in the data_dict dictionary into a list, then assigning that to my dataframe columns.

data_dict['feat_info']['Name'].to_list()[:5]
Enter fullscreen mode Exit fullscreen mode
['MOSTYPE', 'MAANTHUI', 'MGEMOMV', 'MGEMLEEF', 'MOSHOOFD']
Enter fullscreen mode Exit fullscreen mode
df_dict['train'].columns = data_dict['feat_info']['Name'].to_list()
test_df.columns = data_dict['feat_info']['Name'].to_list()
Enter fullscreen mode Exit fullscreen mode

Now the test and train dataframes have the appropriate column names assigned as headers.

test_df.head(3)
Enter fullscreen mode Exit fullscreen mode
MOSTYPE MAANTHUI MGEMOMV MGEMLEEF MOSHOOFD MGODRK MGODPR MGODOV MGODGE MRELGE ... APERSONG AGEZONG AWAOREG ABRAND AZEILPL APLEZIER AFIETS AINBOED ABYSTAND CARAVAN
0 33 1 4 2 8 0 6 0 3 5 ... 0 0 0 1 0 0 0 0 0 0
1 6 1 3 2 2 0 5 0 4 5 ... 0 0 0 1 0 0 0 0 0 1
2 39 1 3 3 9 1 4 2 3 5 ... 0 0 0 1 0 0 0 0 0 0

3 rows × 86 columns

Since the number of columns and the column headers match, it is a matter of a single line to concatenate these datasets into a single dataframe.

df = pd.concat([df_dict['train'], test_df], ignore_index=True)
df
Enter fullscreen mode Exit fullscreen mode
MOSTYPE MAANTHUI MGEMOMV MGEMLEEF MOSHOOFD MGODRK MGODPR MGODOV MGODGE MRELGE ... APERSONG AGEZONG AWAOREG ABRAND AZEILPL APLEZIER AFIETS AINBOED ABYSTAND CARAVAN
0 33 1 3 2 8 0 5 1 3 7 ... 0 0 0 1 0 0 0 0 0 0
1 37 1 2 2 8 1 4 1 4 6 ... 0 0 0 1 0 0 0 0 0 0
2 37 1 2 2 8 0 4 2 4 3 ... 0 0 0 1 0 0 0 0 0 0
3 9 1 3 3 3 2 3 2 4 5 ... 0 0 0 1 0 0 0 0 0 0
4 40 1 4 2 10 1 4 1 4 7 ... 0 0 0 1 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9817 33 1 2 4 8 0 7 2 0 5 ... 0 0 0 1 0 0 0 0 0 0
9818 24 1 2 3 5 1 5 1 3 4 ... 0 0 0 1 0 0 0 0 0 1
9819 36 1 2 3 8 1 5 1 3 7 ... 0 0 0 1 0 0 0 1 0 0
9820 33 1 3 3 8 1 4 2 3 7 ... 0 0 0 0 0 0 0 0 0 0
9821 8 1 2 3 2 4 3 0 3 5 ... 0 0 0 1 0 0 0 0 0 0

9822 rows × 86 columns

Once the data has all been joined into a single dataframe, I want to run some basic "sniff" tests on it. This allows me to be confident that there are no errors in the code that introduce incorrect data into my dataset. The only thing I'm really interested in evaluating for this merge is the number of rows and columns.

I know that the number of columns should be the same as the number of rows in the Name column. At 86 columns, this is correct.

df.shape
Enter fullscreen mode Exit fullscreen mode

To get the correct number of rows I first reviewed the UCI Repo page. It says there are "5000 descriptions of customers" in the training dataset and "4000 customers" in the test dataset. Which means there should be 9,000 rows in the combined dataset. This number doesn't match the number of rows listed in my dataframe, which is 9,822, almost 1,000 more rows than expected.

However, in looking at the shape of my original train and test datasets, I can see that the numbers on the UCI page for the training data was an estimate instead of an exact figure.

* Quick Tip *

It never hurts to double check. However, it sure can hurt if you don't.

print('Training data has', df_dict['train'].shape[0], 'rows')
print('Test data has', df_dict['test'].shape[0], 'rows')
Enter fullscreen mode Exit fullscreen mode
Training data has 5822 rows
Test data has 4000 rows
Test data has 4000 rows
Enter fullscreen mode Exit fullscreen mode

Column names

The provided column names are very cryptic. The variables that start with 'M' in this dataset are a perfect example. Per the Dataset Information:

Note: All the variables starting with M are zipcode variables. They give information on the distribution of that variable, e.g. Rented house, in the zipcode area of the customer.

I.E. These variables are aggregated data enrichments and don't necessarily directly reflect the customer in question. This type of information could easily lead to bias within a model. While this dataset is for direct marketing purposes, and thus sterotyping isn't necessarily a big deal, if this information were used to assess the insurability of a customer or cost of their policy, there could be direct consequences that unfairly impact customers.

Based on these types of considerations, I prefer column names that let me quickly understand what the data in a column represents.

I manually copied and pasted the column descriptions and altered them to my liking. With only 86 columns, this is a feasible solution. If I had more columns, I might not be so quick to resort to manual data entry. For convenience, I've included my column names below so the copy and paste solution doesn't have to be replicated by others.

col_names = ['zip_agg_customer_subtype',
             'zip_agg_number_of_houses',
             'zip_agg_avg_size_household',
             'zip_agg_avg_age',
             'zip_agg_customer_main_type',
             'zip_agg_roman_catholic',
             'zip_agg_protestant',
             'zip_agg_other_religion',
             'zip_agg_no_religion',
             'zip_agg_married',
             'zip_agg_living_together',
             'zip_agg_other_relation',
             'zip_agg_singles',
             'zip_agg_household_without_children',
             'zip_agg_household_with_children',
             'zip_agg_high_level_education',
             'zip_agg_medium_level_education',
             'zip_agg_lower_level_education',
             'zip_agg_high_status',
             'zip_agg_entrepreneur',
             'zip_agg_farmer',
             'zip_agg_middle_management',
             'zip_agg_skilled_labourers',
             'zip_agg_unskilled_labourers',
             'zip_agg_social_class_a',
             'zip_agg_social_class_b1',
             'zip_agg_social_class_b2',
             'zip_agg_social_class_c',
             'zip_agg_social_class_d',
             'zip_agg_rented_house',
             'zip_agg_home_owners',
             'zip_agg_1_car',
             'zip_agg_2_cars',
             'zip_agg_no_car',
             'zip_agg_national_health_service',
             'zip_agg_private_health_insurance',
             'zip_agg_income_<_30.000',
             'zip_agg_income_30-45.000',
             'zip_agg_income_45-75.000',
             'zip_agg_income_75-122.000',
             'zip_agg_income_>123.000',
             'zip_agg_average_income',
             'zip_agg_purchasing_power_class',
             'contri_private_third_party_ins',
             'contri_third_party_ins_(firms)',
             'contri_third_party_ins_(agriculture)',
             'contri_car_policies',
             'contri_delivery_van_policies',
             'contri_motorcycle/scooter_policies',
             'contri_lorry_policies',
             'contri_trailer_policies',
             'contri_tractor_policies',
             'contri_agricultural_machines_policies',
             'contri_moped_policies',
             'contri_life_ins',
             'contri_private_accident_ins_policies',
             'contri_family_accidents_ins_policies',
             'contri_disability_ins_policies',
             'contri_fire_policies',
             'contri_surfboard_policies',
             'contri_boat_policies',
             'contri_bicycle_policies',
             'contri_property_ins_policies',
             'contri_ss_ins_policies',
             'nbr_private_third_party_ins',
             'nbr_third_party_ins_(firms)',
             'nbr_third_party_ins_(agriculture)',
             'nbr_car_policies',
             'nbr_delivery_van_policies',
             'nbr_motorcycle/scooter_policies',
             'nbr_lorry_policies',
             'nbr_trailer_policies',
             'nbr_tractor_policies',
             'nbr_agricultural_machines_policies',
             'nbr_moped_policies',
             'nbr_life_ins',
             'nbr_private_accident_ins_policies',
             'nbr_family_accidents_ins_policies',
             'nbr_disability_ins_policies',
             'nbr_fire_policies',
             'nbr_surfboard_policies',
             'nbr_boat_policies',
             'nbr_bicycle_policies',
             'nbr_property_ins_policies',
             'nbr_ss_ins_policies',
             'nbr_mobile_home_policies']
Enter fullscreen mode Exit fullscreen mode
df.columns = col_names
df.head()
Enter fullscreen mode Exit fullscreen mode
zip_agg Customer Subtype zip_agg Number of houses zip_agg Avg size household zip_agg Avg age zip_agg Customer main type zip_agg Roman catholic zip_agg Protestant zip_agg Other religion zip_agg No religion zip_agg Married ... Nbr private accident ins policies Nbr family accidents ins policies Nbr disability ins policies Nbr fire policies Nbr surfboard policies Nbr boat policies Nbr bicycle policies Nbr property ins policies Nbr ss ins policies Nbr mobile home policies
0 33 1 3 2 8 0 5 1 3 7 ... 0 0 0 1 0 0 0 0 0 0
1 37 1 2 2 8 1 4 1 4 6 ... 0 0 0 1 0 0 0 0 0 0
2 37 1 2 2 8 0 4 2 4 3 ... 0 0 0 1 0 0 0 0 0 0
3 9 1 3 3 3 2 3 2 4 5 ... 0 0 0 1 0 0 0 0 0 0
4 40 1 4 2 10 1 4 1 4 7 ... 0 0 0 1 0 0 0 0 0 0

5 rows × 86 columns

Reverse entineer categorical transformations

I chose this dataset because it had both numeric and categorical variables. I determined this by looking at the "Attribute Characteristics" as listed on the Insurance Company Benchmark Data Set page. It lists "Attribute Characteristics: Categorical, Integer."

Based on the initial examination, this dataset has obviously already been turned into numeric variables, thus negating the whole "includes categorical variables" aspect. This is one of the biggest reasons I don't generally use the UCI datasets: too many of them are already preprocessed and aren't representative of what I see in the real world. I cover more of the preprocessing already done to the dataset in the EDA post.

The data dictionary holds the key to which columns used to be categorical, it's a simple indicator at the end of the Description Domain column. This only applies to the features at indices 0, 3, 4, 5, and 43, as shown below.

Note, Domain should be it's own column. However, there was no super easy way to split it out and it only applies to five rows, so I didn't bother worrying about it.

data_dict['feat_info'].iloc[[0, 3, 4, 5, 43], :]
Enter fullscreen mode Exit fullscreen mode
Nr Name Description Domain
1 1 MOSTYPE Customer Subtype see L0
4 4 MGEMLEEF Avg age see L1
5 5 MOSHOOFD Customer main type see L2
6 6 MGODRK Roman catholic see L3
44 44 PWAPART Contribution private third party insurance see L4

The L0 - L4 datasets that are included in the Data Dictionary are starting to make more sense. They're the textual representation of the categorical features.

for key in ['L0', 'L1', 'L2', 'L3', 'L4']:
    print(key)
    display(data_dict[key].head())
Enter fullscreen mode Exit fullscreen mode
L0
Enter fullscreen mode Exit fullscreen mode
Value Label
89 1 High Income, expensive child
90 2 Very Important Provincials
91 3 High status seniors
92 4 Affluent senior apartments
93 5 Mixed seniors
L1
Enter fullscreen mode Exit fullscreen mode
Value Bin
131 1 20-30 years
132 2 30-40 years
133 3 40-50 years
134 4 50-60 years
135 5 60-70 years
L2
Enter fullscreen mode Exit fullscreen mode
Value Bin
138 1 Successful hedonists
139 2 Driven Growers
140 3 Average Family
141 4 Career Loners
142 5 Living well
L3
Enter fullscreen mode Exit fullscreen mode
Value Bin
149 0 0%
150 1 1 - 10%
151 2 11 - 23%
152 3 24 - 36%
153 4 37 - 49%
L4
Enter fullscreen mode Exit fullscreen mode
Value Bin
160 0 f 0
161 1 f 1 – 49
162 2 f 50 – 99
163 3 f 100 – 199
164 4 f 200 – 499

For brevity, I only printed the head of each of the Lx datasets. From this information I can determine that only two of the categorical features make sense to return to their textual representation.

L0 and L2 can be mapped back to their original text. I can one-hot-encode these variables in my SageMaker Pipeline. L1, L3, and L4 are binned representations of the original values. There is very little benefit to mapping these back to the textual representation, so I'll leave them alone.

Replacing values in a dataframe can be as simple as using a dictionary to map old value to new value. I can create this dictionary using the L0 and L2 dataframes. The Value column corresponds with the numeric representation in the dataset. The Label and Bin columns correspond with the original text. The only catch was to ensure that the Value column was numeric (to match the datatype in the dataframe) and to set Value to the index so it appropriately translated as the key in the dictionary.

data_dict['L0']['Value'] = pd.to_numeric(data_dict['L0']['Value'])
l0_dict = data_dict['L0'].set_index('Value').to_dict()['Label']
l0_dict
Enter fullscreen mode Exit fullscreen mode
{1: 'High Income, expensive child',
 2: 'Very Important Provincials',
 3: 'High status seniors',
 4: 'Affluent senior apartments',
 5: 'Mixed seniors',
 6: 'Career and childcare',
 7: "Dinki's (double income no kids)",
 8: 'Middle class families',
 9: 'Modern, complete families',
 10: 'Stable family',
 11: 'Family starters',
 12: 'Affluent young families',
 13: 'Young all american family',
 14: 'Junior cosmopolitan',
 15: 'Senior cosmopolitans',
 16: 'Students in apartments',
 17: 'Fresh masters in the city',
 18: 'Single youth',
 19: 'Suburban youth',
 20: 'Etnically diverse',
 21: 'Young urban have-nots',
 22: 'Mixed apartment dwellers',
 23: 'Young and rising',
 24: 'Young, low educated ',
 25: 'Young seniors in the city',
 26: 'Own home elderly',
 27: 'Seniors in apartments',
 28: 'Residential elderly',
 29: 'Porchless seniors: no front yard',
 30: 'Religious elderly singles',
 31: 'Low income catholics',
 32: 'Mixed seniors',
 33: 'Lower class large families',
 34: 'Large family, employed child',
 35: 'Village families',
 36: "Couples with teens 'Married with children'",
 37: 'Mixed small town dwellers',
 38: 'Traditional families',
 39: 'Large religous families',
 40: 'Large family farms',
 41: 'Mixed rurals'}
Enter fullscreen mode Exit fullscreen mode
data_dict['L2']['Value'] = pd.to_numeric(data_dict['L2']['Value'])
l2_dict = data_dict['L2'].set_index('Value').to_dict()['Bin']
l2_dict
Enter fullscreen mode Exit fullscreen mode
{1: 'Successful hedonists',
 2: 'Driven Growers',
 3: 'Average Family',
 4: 'Career Loners',
 5: 'Living well',
 6: 'Cruising Seniors',
 7: 'Retired and Religeous',
 8: 'Family with grown ups',
 9: 'Conservative families',
 10: 'Farmers'}
Enter fullscreen mode Exit fullscreen mode
display(data_dict['feat_info'].iloc[[0, 4], :])
print('L0:', df.columns[0])
print('L2:', df.columns[4])
Enter fullscreen mode Exit fullscreen mode
Nr Name Description Domain
1 1 MOSTYPE Customer Subtype see L0
5 5 MOSHOOFD Customer main type see L2
L0: zip_agg Customer Subtype
L2: zip_agg Customer main type
Enter fullscreen mode Exit fullscreen mode

With the mapping dictionaries specified, all I need now is to use the .replace() method on the appropriate dataframe column.

df[df.columns[0]] = df[df.columns[0]].replace(l0_dict)
df[df.columns[4]] = df[df.columns[4]].replace(l2_dict)
Enter fullscreen mode Exit fullscreen mode
df.head()
Enter fullscreen mode Exit fullscreen mode
zip_agg Customer Subtype zip_agg Number of houses zip_agg Avg size household zip_agg Avg age zip_agg Customer main type zip_agg Roman catholic zip_agg Protestant zip_agg Other religion zip_agg No religion zip_agg Married ... Nbr private accident ins policies Nbr family accidents ins policies Nbr disability ins policies Nbr fire policies Nbr surfboard policies Nbr boat policies Nbr bicycle policies Nbr property ins policies Nbr ss ins policies Nbr mobile home policies
0 Lower class large families 1 3 2 Family with grown ups 0 5 1 3 7 ... 0 0 0 1 0 0 0 0 0 0
1 Mixed small town dwellers 1 2 2 Family with grown ups 1 4 1 4 6 ... 0 0 0 1 0 0 0 0 0 0
2 Mixed small town dwellers 1 2 2 Family with grown ups 0 4 2 4 3 ... 0 0 0 1 0 0 0 0 0 0
3 Modern, complete families 1 3 3 Average Family 2 3 2 4 5 ... 0 0 0 1 0 0 0 0 0 0
4 Large family farms 1 4 2 Farmers 1 4 1 4 7 ... 0 0 0 1 0 0 0 0 0 0

5 rows × 86 columns

Save data

The last step is to save all my hard work back to S3.

df.to_csv(f's3://{bucket}/{prefix}/full.csv', index=False)
Enter fullscreen mode Exit fullscreen mode

Delete Files

To ensure no ongoing charges are charged to your account, you can delete the files from S3.

folder_prefix = '1_ins_dataset/'
s3_resource = boto3.resource("s3")
Enter fullscreen mode Exit fullscreen mode
s3_bucket = s3_resource.Bucket(bucket)
s3_bucket.objects.filter(Prefix=folder_prefix).delete()
Enter fullscreen mode Exit fullscreen mode
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .