White Wine Dataset Exploratory Data Analysis, by Emad Takla

{r global_options, include=FALSE} knitr::opts_chunk$set(echo=FALSE, warning=FALSE, message=FALSE)

Getting to Know the Data


The Dataset has 4898 entries, with 13 parameters describing them. The columns are the ID of the wine, 11 variables describing the wine and finally, the wine’s score as rated by at least 3 “wine experts”.


Summary of Each Parameter
mean sd median min max n
X 2449.5000000 1414.0751394 2449.50000 1.00000 4898.00000 4898
fixed.acidity 6.8547877 0.8438682 6.80000 3.80000 14.20000 4898
volatile.acidity 0.2782411 0.1007945 0.26000 0.08000 1.10000 4898
citric.acid 0.3341915 0.1210198 0.32000 0.00000 1.66000 4898
residual.sugar 6.3914149 5.0720578 5.20000 0.60000 65.80000 4898
chlorides 0.0457724 0.0218480 0.04300 0.00900 0.34600 4898
free.sulfur.dioxide 35.3080849 17.0071373 34.00000 2.00000 289.00000 4898
total.sulfur.dioxide 138.3606574 42.4980646 134.00000 9.00000 440.00000 4898
density 0.9940274 0.0029909 0.99374 0.98711 1.03898 4898
pH 3.1882666 0.1510006 3.18000 2.72000 3.82000 4898
sulphates 0.4898469 0.1141258 0.47000 0.22000 1.08000 4898
alcohol 10.5142670 1.2306206 10.40000 8.00000 14.20000 4898
quality 5.8779094 0.8856386 6.00000 3.00000 9.00000 4898

Univariate Plots Section

Fixed Acidity

This one is measured by the tartaric acid’s concentration within the wine. These acids do not evaporate easily. We have 0 missing values.

Bell shaped, a very small positive skewness.

Volatile Acidity

This one is measured by the acetic acid’s concentration within the wine. At too high level, this will cause an unpleasant, vinegar like taste. We have 0 missing values.

Bell shaped, but a more pronounced skewness here, a negative one.

Citric Acid

Found in small quantities, citric acid can dd frehness and flavor to wine. We have 0 missing values.

Also bell shaped, a little skewed to the left.

Residual Sugar

The amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet. We have 0 missing values.

A bimodal shape here, with an extremely high spike at the start (Dry wine, no or very little residual sugar), then another smaller summit at roughly 8 g/dm^3


The amount of salt in the wine. We have 0 missing values.

A lot of outliers in here. But the main bulk has a balanced bell shape.

Free Sulfur Dioxide

The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine. At free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine We have 0 missing values.

Same as chlorides, but the proportion of outliers here is a lot less.

Total Sulfur Dioxide

Amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine. We have 0 missing values.

The outliers here are even less than that of the free sulfurs.


The density of water is close to that of water depending on the percent alcohol and sugar content. We have 0 missing values.

Very little outliers, and a very narrow density range (I mean numerically). The vast majority of the wines are just a tiny bit less dense than water.


Describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale. We have 0 missing values.

A classical bell shape, and all wines are within the acidic pH range with a mean pH around 3.2


A wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant. We have 0 missing values.

Positively skewed bell shape.


The percent alcohol content of the wine. We have 0 missing values.

Alcohol distribution has a heavy positive skeweness. The peak is maybe at 9%, but the bulk of bottles have a higher alcohol rate.

Quality Score

Expert Rating for the wine, on a scale from 1 to 10. We have 0 missing values.

Minimum score is 3, maximum is 9. The most common score is 6, but there are more bottles with a worse score than not.

Univariate Analysis

What is the structure of your dataset?

The dataset has 11 variables describing - chemically - the wine, and one last variable for the wine’s quality as perceived by wine experts, graded from 0 (very bad) to 10 (highest quality). Each wine is was evaluated by at least 3 experts. As declared: “Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).”

I honestly would have wished to have the price tag as well.

What is/are the main feature(s) of interest in your dataset?

From the univariable analysis, there is not really any striking variable. All of them have a - more or less - bell shape, with different levels of skewness. Intuitevly though, the most important factor would be the quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I will wait for the correlation calculation to see which variables are most related to quality. I would suspect that acidity, sulfur contents and alcohol level would be important factors.

Did you create any new variables from existing variables in the dataset?

Not yet

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Alcohol was positively skewed in a very noticeable way, while quality was slightly negatively skewed. There is not a lot of outliers in the data.

Bivariate Plots Section

General Overview