White Wine Dataset Exploratory Data Analysis, by Emad Takla

{r global_options, include=FALSE} knitr::opts_chunk$set(echo=FALSE, warning=FALSE, message=FALSE)

Getting to Know the Data

Size

The Dataset has 4898 entries, with 13 parameters describing them. The columns are the ID of the wine, 11 variables describing the wine and finally, the wine’s score as rated by at least 3 “wine experts”.

Summary

Summary of Each Parameter
mean sd median min max n
X 2449.5000000 1414.0751394 2449.50000 1.00000 4898.00000 4898
fixed.acidity 6.8547877 0.8438682 6.80000 3.80000 14.20000 4898
volatile.acidity 0.2782411 0.1007945 0.26000 0.08000 1.10000 4898
citric.acid 0.3341915 0.1210198 0.32000 0.00000 1.66000 4898
residual.sugar 6.3914149 5.0720578 5.20000 0.60000 65.80000 4898
chlorides 0.0457724 0.0218480 0.04300 0.00900 0.34600 4898
free.sulfur.dioxide 35.3080849 17.0071373 34.00000 2.00000 289.00000 4898
total.sulfur.dioxide 138.3606574 42.4980646 134.00000 9.00000 440.00000 4898
density 0.9940274 0.0029909 0.99374 0.98711 1.03898 4898
pH 3.1882666 0.1510006 3.18000 2.72000 3.82000 4898
sulphates 0.4898469 0.1141258 0.47000 0.22000 1.08000 4898
alcohol 10.5142670 1.2306206 10.40000 8.00000 14.20000 4898
quality 5.8779094 0.8856386 6.00000 3.00000 9.00000 4898

Univariate Plots Section

Fixed Acidity

This one is measured by the tartaric acid’s concentration within the wine. These acids do not evaporate easily. We have 0 missing values.

Bell shaped, a very small positive skewness.

Volatile Acidity

This one is measured by the acetic acid’s concentration within the wine. At too high level, this will cause an unpleasant, vinegar like taste. We have 0 missing values.

Bell shaped, but a more pronounced skewness here, a negative one.

Citric Acid

Found in small quantities, citric acid can dd frehness and flavor to wine. We have 0 missing values.

Also bell shaped, a little skewed to the left.

Residual Sugar

The amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet. We have 0 missing values.

A bimodal shape here, with an extremely high spike at the start (Dry wine, no or very little residual sugar), then another smaller summit at roughly 8 g/dm^3

Chlorides

The amount of salt in the wine. We have 0 missing values.

A lot of outliers in here. But the main bulk has a balanced bell shape.

Free Sulfur Dioxide

The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine. At free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine We have 0 missing values.

Same as chlorides, but the proportion of outliers here is a lot less.

Total Sulfur Dioxide

Amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine. We have 0 missing values.

The outliers here are even less than that of the free sulfurs.

Density

The density of water is close to that of water depending on the percent alcohol and sugar content. We have 0 missing values.

Very little outliers, and a very narrow density range (I mean numerically). The vast majority of the wines are just a tiny bit less dense than water.

pH

Describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale. We have 0 missing values.

A classical bell shape, and all wines are within the acidic pH range with a mean pH around 3.2

Sulphates

A wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant. We have 0 missing values.

Positively skewed bell shape.

Alcohol

The percent alcohol content of the wine. We have 0 missing values.

Alcohol distribution has a heavy positive skeweness. The peak is maybe at 9%, but the bulk of bottles have a higher alcohol rate.

Quality Score

Expert Rating for the wine, on a scale from 1 to 10. We have 0 missing values.

Minimum score is 3, maximum is 9. The most common score is 6, but there are more bottles with a worse score than not.

Univariate Analysis

What is the structure of your dataset?

The dataset has 11 variables describing - chemically - the wine, and one last variable for the wine’s quality as perceived by wine experts, graded from 0 (very bad) to 10 (highest quality). Each wine is was evaluated by at least 3 experts. As declared: “Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).”

I honestly would have wished to have the price tag as well.

What is/are the main feature(s) of interest in your dataset?

From the univariable analysis, there is not really any striking variable. All of them have a - more or less - bell shape, with different levels of skewness. Intuitevly though, the most important factor would be the quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I will wait for the correlation calculation to see which variables are most related to quality. I would suspect that acidity, sulfur contents and alcohol level would be important factors.

Did you create any new variables from existing variables in the dataset?

Not yet

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Alcohol was positively skewed in a very noticeable way, while quality was slightly negatively skewed. There is not a lot of outliers in the data.

Bivariate Plots Section

General Overview

These graphs bring a lot of insight. From the quality stand point, the strongest correlation exists with the alcohol level. It is not surprising then that the strongest reverse correlation with quality is the density of the wine, since the alcohol level is a strong factor for density: the more alcohol we have, the lesser is the density.

Density

Density seems to be affected by a lot of factors. It also seems that it is an important factor with quality, having an inverse relationship with it. The top elements affecting density are: 1- Residual Sugar 2- Alcohol 3- Total Sulfur Dioxide

Density vs Residual Sugar

Density vs Alcohol

Density vs Total Sulfur Dioxide

Quality

Enhancing Factor

By correlation, only alcohol content has a postive correlation with quality.

Degrading Factors

Density

Density seems to be the most indicating factor about the quality (or lack of, since it is an inversly proportional relationship). It can seem immediately logical that this is just the reflection of the lack of alcohol - since alcohol is correlated with quality, and it reduces the wine’s density at the same time. But on the other hand, density is also affected by other factors, and so it should be a good parameter to observe:

Other Factors

The top three degrading variables (By correlation) are Volatile Acidity, Chlorides and Total Sulfure Dioxide

Volatile Acidity:

Salts:

Total Sulfur Dioxide:

Total of the three elements

The two factors that I would like to further explore are the quality and alcohol content. These will be categorized each into three categories. For the quality, the three categories are:

  • Low Quality (Scores 3 and 4)

  • Medium Quality (Scores 5, 6 and 7)

  • High Quality (Scores 8 and 9)

For the alcohol contents, the categories are:

  • Low Alcohol (Under 10%)

  • Medium Alcohol (Between 10% and 12%)

  • High Alcohol (Over 12%)

These divisions are arbitrary, they are not based over how experts would categorize wine. The boundaries were based over balancing the number of bottles between each category. When I tried to look it up online, a high alcohol wine would be something above 14.5% for example, which would make such a categorisation useless since we do not have any bottles with such a high alcohol level (The highest here is 14.2%)

Alcohol, a Categorised Perspective

Residual Sugar

Sugar contents have an inverse correlation with the alcohol content.

Total Sulfur Dioxide

The lower the alcohol, the higher sulfur contents. Sulfur dioxide is a preservative, and so is alcohol. Maybe they add it when there is a low alcohol so that the wine would stay fresh for a longer time?

Quality, a Categorized Perspective

Saltiness and Quality

Not too surprising, but the trend here is: the lesser salt the better the taste.

Acidity Effect Over Quality

We can see from these two graphs above an interesting insight about the effect of volatile acidity over wine. The high acid content is really a trend for bad wine, but the lack of acidity is not an indicator for quality. So low acid does not indicate anything, but high acid would probably mean an inferior quality of the wine. How Alcohol Categories are Ditributed Over Quality Score ?

Citric Acid and Quality

I have the impression that the citric acid contents has a narrow window to be the right amount. Too low or too high (Especially too low) is more trending in low quality wines. It is striking how the medium and high quality wines have similar shapes, around the same concentration.

How Alcohol Categories are Ditributed Over Quality Score ?

The strong correlation between quality and alcohol is indeed the most interesting in my opinion so far. Let us explore the make up of each quality score in terms of wine alcohol category:

I think this plot is an excellent visualization for the high correlation between alcohol amount and quality. Despite being a minority, high alcohol wines dominate the highest scores. More over, low alcohol bottles gradually decrease as the the score gets better.

Medium alcohol is distributed all over, making up almost half the bottles for each score, except for the highest score. But since we have a lot more medium alcohol bottles than the two other categories, I will try to see the distribution of each type of alcohol content over all the scores. The y axis will represent the percentage in terms of each category independetly, not as a percentage of all the bottles. This normalization should reduce the effect of the different size of each alcohol category.

The distribution looks like a bell shape for all the alcohol categeries, even for the high alcohol ones. High alcohol is slightly leaning (skewed) towards the higher scores, and the medium alcohol is slightly leaning towards the lesser scores. The only decisive distribution is that of the low alcohol bottles, being heavily concentrated towards the lesser scores, and having its peak at score of 5 out of 10. Medium and high alcohol’s peak is at 6/10.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Density, although its range of variation is narrow (All, more or less, are around the density of water), had a lot of correlations with different factors. Residual sugars, alcohol contents and total sulfur dioxide were the most defining factors.

Next, quality. Generally speaking, most factors had negative correlations with quality except for the alcohol level, and this makes sense. Excessive volatile acidity would give wine a vinegar like taste, chlorides make the wine salty and sulfur dioxide tastes like matches. So when any of these contents are high enough to a noticeable level, they would degrade the rating of the wine. For alcohol, I think this is an indirect relationship with quality though. Alcohol is a preservative, and its presence in good quantities reduces the need of adding other preservatives like sulfur dioxide. But of course this is not a conclusion driven from the data, it is just my intuition about it. Saltiness density distribution is remarkable, with the best qualities wines having a distribution with smaller salt concentration.

For acidity, all three quality categories had positive skewness. The more interesting factor of their density distribution shape was the kurtosis: low quality wine had the lowest kurtosis, but the kurtosis of the other two is a bit confusing: medium wine had the highest one, and not the high quality wine. More over, they had almost their summit around the same value. That gives me the impression that both fixed and volatile acidity are not a critical factor for quality, although that it had some of the highest correlation with quality (Not high coefficients on absolute scale, in fact these should be considered as weak correlations, but I mean in comparison with other factors). The last, and most interesting acidity factor in my opinion was the citric acid. Its kurtosis was really prominent, giving the impression that the right quantity was really within a narrow window, somewhere between 0.2 and a little less than 0.5 g/dm**3. Although the highest three degrading components - chlorides, volatile acidity and total sulfur dioxide - had correlations of -.21, -.195 and -.175 respectively, their total per volume had a stronger negative correlation (-.268). I have tried to explore more by creating different combinations of their ratios, but these did not lead to any significant correlation worth mentioning.

Finally, alcohol level. The relationship between alcohol and residual sugars is generally inversly proportional, but the interesting part of it is that for low alcohol wines, the density distribution was tri-modal. So a low alcohol wine can have sugar concentration all over the spectrum, while medium and high alcohol content wines are more leaning towards being less sweet. Alcohol and density are related, but that was something already expected, as alcohol has a lower density than water, so the more alcohol we have the less total density is going to be. Most ingredients had a negative correlation with alcohol, so the trend for alcohol is that the higher it is, the less the acidity, sweetness and saltiness of the wine.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Yes: pH was most affected by the fixed acidity (more than volatile acidity). In fact, I was surprised that the pH had an almost 0 correlation coefficient with volatile acidity. Sulphates and alcohol had a small positive correlation with pH.

What was the strongest relationship you found?

Density: Residual Sugar (coeff = .839) Total Sulfur Dioxide (coeff = .53) Alcohol (coeff = -.449)

Quality: Alcohol (coeff = .436) Density (coeff = -.307) Total of Chlorides, volatile acidity and total sulfur dioxide (-.268)

Alcohol: Total Sulfur Dioxide (coeff = .449) Residual Sugar (coeff = -.451) Chlorides (coeff = -.36) (There are other higher correlations, quality and density, but these were more affected by the alcohol level rather than affecting it)

Multivariate Plots Section

NB: The coloring code that will be used through this section is the same as the previous section. Here is a reminder:

** Wine Quality:**

** - High Quality: Green**

** - Medium Quality: Yellow**

** - Low Quality: Red**

Wine Alcohol Content:

** - High Alcohol: Green**

** - Medium Alcohol: Yellow**

** - Low Alcohol: Red**

Quality

Quality Versus Alcohol Percentage and Total Sulfur Dioxide

Quality Versus Saltiness (Chlorides) and Sweetness (Residual Sugar)

Quality Versus the Two Types of Acidity

Alcohol

Alcohol Versus Saltiness (Chlorides) and Sweetness (Residual Sugar)

Alcohol Versus Quality and Total Sulfur Dioxide

This graph is confirming what is already known, but I like how it transfers the information, sort of spectrum bands, like light components analysis. But I think adding a jitter will make us see more details of the structure of each column.

I like this visualization, I will repeat it but this time it will be for the total content of the trio degrading elements, that is voltaile acidity (Vinegar taste), chlorides (saltiness) and total sulfur dioxide (matches taste).

This last graph by itself may not be that interesting, but what I find eye catching is when I have compared it to the one before: High Alcohol wines dots moved higher, and the medium and low alcohol bottles moved lower. I find this metric (the degrading trio) interesting, but still I would take it with a grain of salt. Numerically, the whole metric is highly dominated by the volatile acidity and this is the part that makes me doubt it a bit. But the pro for this metric is that it has a better correlation coefficient with quality than any of its constituant alone.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

For the aclohol level, the multivariate analysis just confirmed what was already induced from previous plots, with the highest quality scores (6 and onwards) having a noticeable higher alcohol wines.

Were there any interesting or surprising interactions between features?

Although that for the most part you can visually distinguish that high quality and low quality wines occupied different regions, the medium quality wine overlapped with both regions - which casted some doubt over my selection criteria for quality. Of course there is also the numbers game effect, the amount of medium quality wine was huge compared to the other two categories, so maybe this is just the effect of the extreme imbalance in terms of quality within the sample. I thought of changing my ranking method to be: bad wine <= 5/10, medium wine = 6, high quality >= 7 so that I would get a more equal distribution among the three different categories, but after some thinking, I would not call a 5/10 wine to be of a bad quality, and would not call a 7 to be of excellent quality. But on the other hand, if I was working on a classification model, I would have picked that newer method, I think it would have led to a more accurate classification.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I have not yet, but I think I will work over that problem with my preferred tools for that task: Python and SciKit Learn. I am suspecting that the best two factors to be used for predicting quality are Alcohol Percentage and the total of Chlorides, Volatile Acidity and Total Sulfur Dioxide (in grams/dm^3). Density might seem a good candidate, but since we have included the top density affecting elements already, and that density is also strongly affected by residual sugars which show no correlation at all with quality, I think it might actually be better to exclude it. I can use too citric acid, since it looks like having a very defining window for wine quality.


Final Plots and Summary

Plot One

Description One

After splitting wines into three categories in terms of alcohol content, the distribution of each category over quality scores showed an interesting trend: - High and Medium alcohol bottles had a peak at score of 6, while low alcohol had its peak t a lower score, 5 - Low alcohol wine had a very prominenet skewness, leaning towards the worst scores. Although that medium and high alcohol had a lot more regular shapes, the medium alcohol was slightly skewed towards the lesser scores while the higher alcohol were slightly skewed towards the higher scores. - All scores had candidates from all categories, except for the best available score (9), That one had no low alcohol bottles at all, with the majority of the bottles being from the high alcohol category.

Plot Two

Description Two

Citric acid can add ‘freshness’ and flavor to wines. When visualising the density distribution of the citric acide in comparison with different qualities, ones get the feeling that there is a narrow window for the “right” amount of citric acid (Roughly between 0.2 and 0.4 ). The higher the wine quality, the more bottles we have concentrated around this window.

Plot Three

Description Three

There are three degrading elements for the wine quality, when they are present in noticeable quantities. Acetic acid (Volatile acidity) adds a vinegar like taste, while chlorides make the wine salty. The last factor, sulfur dioxide, has a taste and smell like that of matches. When the three are added to make up a sum in grams per decimeter cube and spread over the different scores, we can notice something interesting regarding the highest scoring bottles: the higher is the alcohol, the more these three degrading elements were tolerated. High scoring bottles that have medium amount of alcohol had a lesser amount of this degrading trio. Higher alcohol bottles are noticeably shifted upwards (Observe score number 8), and generally the higher the point is in this graph, the more likely is that it would be less favourably regarded.


Reflection

We can see some quality trends in this dataset, although that they are not strong trends. The most noteworthy trends are: - Alcohol level - Volatile acidity - Chlorides - Total amount of sulfur dioxide

These parameters in general make sense, but the other thing that I think is worth discussing here is the quality parameter itself. There is a lot of subjectivity in the notion itself, and people have different tastes and perceive taste differently. Of course with the fact that the graders were wine experts, this reduces a lot the subjectivity, I would have preferred that the quality score would have been broken down into more constituants, or at least there would be an explanation about how this score was calculated.

Also, I would have liked to examine the relationship between the price, geography and weather of the vinyard, total sales and the quality. But these factors were ommitted for privacy and logistical reasons.

My last reflection, this time it is not about this dataset but about the project in general: I have started by picking a fairly complicated dataset, about astronomy. I struggled a lot with the analysis, and have spendt a lot of time trying to gain a good understanding about the patterns found. I have learned a lot, but I have felt that this is not going anywhere unless I can get a good subject matter knowledge support. When I have switched to the wine dataset, life was a lot easier. So my last reflection would be about the field expertise; the more we have about it, the more the EDA process is less cumbersome.