DataCleaningChallenge: Day 2

DataCleaningChallenge: Day 2

ยท

4 min read

It's Day 2 of the challenge. And the heat is up! Many participants of this challenge (including myself) can agree that this is one Messi data. Get it?๐Ÿ˜

It has been quite challenging trying to clean up such a large dataset. You're thinking, "There are so many things wrong with this dataset." and at the same time, "Where do I start?"

To overcome this seemingly insurmountable messiness, I have decided to take a different approach to my cleaning process.

I got this idea from the ALX-Udacity Data Analyst Nanodegree program I took last year and it's been very helpful.

So basically, the idea is to break the whole process into two: assessing first, and then cleaning. This is a very efficient method and that's how I've chosen to approach the challenge.

So today, I took my time to carefully assess the data, visually and programmatically.

The data loaded into my Jupyter Notebook didn't show all rows and columns at once. So I decided to use Excel for the assessment. However, Excel didn't display the names properly as a result of the presence of non-English alphabets but that doesn't matter since I'll be doing the actual cleaning in Python. The Excel filter feature made it easy to glance through the unique values in each row to detect anomalies.

Having filtered the dataset as shown below, I examined each column to check for irregularities and I found quite a lot of issues. I would be classifying these issues based on the dimensions of quality data(I mentioned this in my previous post).

The data quality dimensions are:

  1. Completeness: The following questions are asked? Are there missing values in the dataset? Are there duplicates? There are no duplicates as seen below.

    From the assessment done yesterday, we saw that two columns had missing data: Loan Date End and Hits. To avoid issues with operations during analysis, it would be best to handle the missing values in these columns instead of leaving them blank.

Problem 1: Loan Date End and Hits Columns have missing values

2. Validity: This checks if the records conform to a defined schema.

Problem 2: The column with the name >OVA is wrong.

This is confirmed by the data dictionary.

Problem 3: The Club column contains \n\n before the club name.

Problem 4: The Contract Column is inconsistent. There are two different variables in one column. And there are two different methods of recording the same parameter.

I noticed that some players were on a free contract which was odd. So I filtered to see just these players and I noticed that the players on free contracts also had no clubs and this made more sense.

Problem 4: The height column is invalid because of the presence of the units(cm and inches).

Problem 5: The weight column is invalid because of the presence of the units(kg and lbs).

Problem 6: The columns indicating dates such as Contract, Loan Date End and Joined columns have the wrong data type.

Problem 7: Value and Wage columns contain the currency unit, euro and also suffixes (K and M) to indicate thousand and million.

Problem 8: The Value, Wage, Height and Weight columns are of the wrong datatype.

Problem 9: Hits Column contains the suffix K which represents thousand. It also has wrong datatype.

  1. Accuracy: This checks the data for correctness. A record can be valid but still inaccurate, example, having recorded weight as 2kg, this may be valid but inaccurate as no player can be as light as 2kg.

Problem 10: The Wage Column contains values that are too low to be accurate. For example, a player whose earning is 800 euros is questionable. This could be as a result of omission of the suffices. More research is needed for this.

  1. Consistency: This checks if the same data is represented in the same way across the dataset.

    Problem 11: The height column is recorded in cm and inches.

Problem 12: The weight column is recorded in kg and lbs.

Problem 13: The Value and Wage columns are recorded in K and M.

Problem 14: The Hits column records values as both integers and strings(e.g. 1.6K)

I decided to drop the name column and simply use the longname column. Then, rename long name to nane. I have no idea which dimension to place this though.

And that's all for Day 2. Wasn't that exciting? Just like examining at a murder scene, looking for clues. That's how cool cleaning data feels to me.

Tomorrow, we'll tackle each if these issues and who knows we might even find more. Stay tuned!

ย