Introduction

Diabetes is a critical global health priority, as its rising prevalence places an increasing burden on healthcare systems and individual quality of life.

In this analysis, we examine the diabetes prevalence in the top 20 countries and track how the rates changed from 2011 to 2021. By comparing this data with national obesity rates, the project uses exploratory data analysis to highlight the relationship between weight-related metrics and the decade-long growth of the diabetes epidemic.

Data Preparation

# Load required packages
library(tidyverse)
library(ggplot2)

# Read dataset
diabetes <- read.csv("~/Desktop/R project/3_Diabetes/Diabetes prevalence.csv")
str(diabetes)
## 'data.frame':    256 obs. of  19 variables:
##  $ FREQ              : chr  "A" "A" "A" "A" ...
##  $ FREQ_LABEL        : chr  "Annual" "Annual" "Annual" "Annual" ...
##  $ REF_AREA          : chr  "ABW" "AFE" "AFG" "AFW" ...
##  $ REF_AREA_LABEL    : chr  "Aruba" "Africa Eastern and Southern" "Afghanistan" "Africa Western and Central" ...
##  $ INDICATOR         : chr  "WB_HNP_SH_STA_DIAB_ZS" "WB_HNP_SH_STA_DIAB_ZS" "WB_HNP_SH_STA_DIAB_ZS" "WB_HNP_SH_STA_DIAB_ZS" ...
##  $ INDICATOR_LABEL   : chr  "Diabetes prevalence (% of population ages 20 to 79)" "Diabetes prevalence (% of population ages 20 to 79)" "Diabetes prevalence (% of population ages 20 to 79)" "Diabetes prevalence (% of population ages 20 to 79)" ...
##  $ UNIT_MEASURE      : chr  "PT" "PT" "PT" "PT" ...
##  $ UNIT_MEASURE_LABEL: chr  "Percentage" "Percentage" "Percentage" "Percentage" ...
##  $ DATABASE_ID       : chr  "WB_HNP" "WB_HNP" "WB_HNP" "WB_HNP" ...
##  $ DATABASE_ID_LABEL : chr  "Health Nutrition and Population Statistics" "Health Nutrition and Population Statistics" "Health Nutrition and Population Statistics" "Health Nutrition and Population Statistics" ...
##  $ UNIT_MULT         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ UNIT_MULT_LABEL   : chr  "Units" "Units" "Units" "Units" ...
##  $ OBS_STATUS        : chr  "A" "A" "A" "A" ...
##  $ OBS_STATUS_LABEL  : chr  "Normal value" "Normal value" "Normal value" "Normal value" ...
##  $ OBS_CONF          : chr  "PU" "PU" "PU" "PU" ...
##  $ OBS_CONF_LABEL    : chr  "Public" "Public" "Public" "Public" ...
##  $ X2000             : num  12.1 NA NA NA NA NA NA NA NA NA ...
##  $ X2011             : num  12.4 4.59 7.6 4.41 2.9 ...
##  $ X2021             : num  4.3 7.38 10.9 3.39 4.6 ...
obesity <- read.csv("~/Desktop/R project/3_Diabetes/Obesity prevalance.csv")
str(obesity)
## 'data.frame':    398 obs. of  34 variables:
##  $ IndicatorCode             : chr  "NCD_BMI_30C" "NCD_BMI_30C" "NCD_BMI_30C" "NCD_BMI_30C" ...
##  $ Indicator                 : chr  "Prevalence of obesity among adults, BMI &GreaterEqual; 30 (crude estimate) (%)" "Prevalence of obesity among adults, BMI &GreaterEqual; 30 (crude estimate) (%)" "Prevalence of obesity among adults, BMI &GreaterEqual; 30 (crude estimate) (%)" "Prevalence of obesity among adults, BMI &GreaterEqual; 30 (crude estimate) (%)" ...
##  $ ValueType                 : chr  "numeric" "numeric" "numeric" "numeric" ...
##  $ ParentLocationCode        : chr  "WPR" "SEAR" "AFR" "AFR" ...
##  $ ParentLocation            : chr  "Western Pacific" "South-East Asia" "Africa" "Africa" ...
##  $ Location.type             : chr  "Country" "Country" "Country" "Country" ...
##  $ SpatialDimValueCode       : chr  "VNM" "LKA" "AGO" "CIV" ...
##  $ Location                  : chr  "Viet Nam" "Sri Lanka" "Angola" "Cote d'Ivoire" ...
##  $ Period.type               : chr  "Year" "Year" "Year" "Year" ...
##  $ Period                    : int  2021 2021 2021 2021 2021 2021 2021 2021 2021 2021 ...
##  $ IsLatestYear              : chr  "false" "false" "false" "false" ...
##  $ Dim1.type                 : chr  "Sex" "Sex" "Sex" "Sex" ...
##  $ Dim1                      : chr  "Both sexes" "Both sexes" "Both sexes" "Both sexes" ...
##  $ Dim1ValueCode             : chr  "SEX_BTSX" "SEX_BTSX" "SEX_BTSX" "SEX_BTSX" ...
##  $ Dim2.type                 : chr  "Age Group" "Age Group" "Age Group" "Age Group" ...
##  $ Dim2                      : chr  "18+  years" "18+  years" "18+  years" "18+  years" ...
##  $ Dim2ValueCode             : chr  "AGEGROUP_YEARS18-PLUS" "AGEGROUP_YEARS18-PLUS" "AGEGROUP_YEARS18-PLUS" "AGEGROUP_YEARS18-PLUS" ...
##  $ Dim3.type                 : logi  NA NA NA NA NA NA ...
##  $ Dim3                      : logi  NA NA NA NA NA NA ...
##  $ Dim3ValueCode             : logi  NA NA NA NA NA NA ...
##  $ DataSourceDimValueCode    : logi  NA NA NA NA NA NA ...
##  $ DataSource                : logi  NA NA NA NA NA NA ...
##  $ FactValueNumericPrefix    : logi  NA NA NA NA NA NA ...
##  $ FactValueNumeric          : num  1.91 10.04 10.19 10.17 10.34 ...
##  $ FactValueUoM              : logi  NA NA NA NA NA NA ...
##  $ FactValueNumericLowPrefix : logi  NA NA NA NA NA NA ...
##  $ FactValueNumericLow       : num  1.48 8.62 6.76 8.18 5.57 5.46 8.89 9.36 8.9 9.33 ...
##  $ FactValueNumericHighPrefix: logi  NA NA NA NA NA NA ...
##  $ FactValueNumericHigh      : num  2.42 11.57 14.26 12.6 16.65 ...
##  $ Value                     : chr  "1.9 [1.5-2.4]" "10.0 [8.6-11.6]" "10.2 [6.8-14.3]" "10.2 [8.2-12.6]" ...
##  $ FactValueTranslationID    : logi  NA NA NA NA NA NA ...
##  $ FactComments              : logi  NA NA NA NA NA NA ...
##  $ Language                  : chr  "EN" "EN" "EN" "EN" ...
##  $ DateModified              : chr  "2024-02-28T14:00:00.000Z" "2024-02-28T14:00:00.000Z" "2024-02-28T14:00:00.000Z" "2024-02-28T14:00:00.000Z" ...

Data Inspection

head(diabetes)
##   FREQ FREQ_LABEL REF_AREA              REF_AREA_LABEL             INDICATOR
## 1    A     Annual      ABW                       Aruba WB_HNP_SH_STA_DIAB_ZS
## 2    A     Annual      AFE Africa Eastern and Southern WB_HNP_SH_STA_DIAB_ZS
## 3    A     Annual      AFG                 Afghanistan WB_HNP_SH_STA_DIAB_ZS
## 4    A     Annual      AFW  Africa Western and Central WB_HNP_SH_STA_DIAB_ZS
## 5    A     Annual      AGO                      Angola WB_HNP_SH_STA_DIAB_ZS
## 6    A     Annual      ALB                     Albania WB_HNP_SH_STA_DIAB_ZS
##                                       INDICATOR_LABEL UNIT_MEASURE
## 1 Diabetes prevalence (% of population ages 20 to 79)           PT
## 2 Diabetes prevalence (% of population ages 20 to 79)           PT
## 3 Diabetes prevalence (% of population ages 20 to 79)           PT
## 4 Diabetes prevalence (% of population ages 20 to 79)           PT
## 5 Diabetes prevalence (% of population ages 20 to 79)           PT
## 6 Diabetes prevalence (% of population ages 20 to 79)           PT
##   UNIT_MEASURE_LABEL DATABASE_ID                          DATABASE_ID_LABEL
## 1         Percentage      WB_HNP Health Nutrition and Population Statistics
## 2         Percentage      WB_HNP Health Nutrition and Population Statistics
## 3         Percentage      WB_HNP Health Nutrition and Population Statistics
## 4         Percentage      WB_HNP Health Nutrition and Population Statistics
## 5         Percentage      WB_HNP Health Nutrition and Population Statistics
## 6         Percentage      WB_HNP Health Nutrition and Population Statistics
##   UNIT_MULT UNIT_MULT_LABEL OBS_STATUS OBS_STATUS_LABEL OBS_CONF OBS_CONF_LABEL
## 1         0           Units          A     Normal value       PU         Public
## 2         0           Units          A     Normal value       PU         Public
## 3         0           Units          A     Normal value       PU         Public
## 4         0           Units          A     Normal value       PU         Public
## 5         0           Units          A     Normal value       PU         Public
## 6         0           Units          A     Normal value       PU         Public
##   X2000     X2011     X2021
## 1  12.1 12.400000  4.300000
## 2    NA  4.587181  7.381941
## 3    NA  7.600000 10.900000
## 4    NA  4.412739  3.389805
## 5    NA  2.900000  4.600000
## 6    NA  2.800000 10.200000
str(diabetes)
## 'data.frame':    256 obs. of  19 variables:
##  $ FREQ              : chr  "A" "A" "A" "A" ...
##  $ FREQ_LABEL        : chr  "Annual" "Annual" "Annual" "Annual" ...
##  $ REF_AREA          : chr  "ABW" "AFE" "AFG" "AFW" ...
##  $ REF_AREA_LABEL    : chr  "Aruba" "Africa Eastern and Southern" "Afghanistan" "Africa Western and Central" ...
##  $ INDICATOR         : chr  "WB_HNP_SH_STA_DIAB_ZS" "WB_HNP_SH_STA_DIAB_ZS" "WB_HNP_SH_STA_DIAB_ZS" "WB_HNP_SH_STA_DIAB_ZS" ...
##  $ INDICATOR_LABEL   : chr  "Diabetes prevalence (% of population ages 20 to 79)" "Diabetes prevalence (% of population ages 20 to 79)" "Diabetes prevalence (% of population ages 20 to 79)" "Diabetes prevalence (% of population ages 20 to 79)" ...
##  $ UNIT_MEASURE      : chr  "PT" "PT" "PT" "PT" ...
##  $ UNIT_MEASURE_LABEL: chr  "Percentage" "Percentage" "Percentage" "Percentage" ...
##  $ DATABASE_ID       : chr  "WB_HNP" "WB_HNP" "WB_HNP" "WB_HNP" ...
##  $ DATABASE_ID_LABEL : chr  "Health Nutrition and Population Statistics" "Health Nutrition and Population Statistics" "Health Nutrition and Population Statistics" "Health Nutrition and Population Statistics" ...
##  $ UNIT_MULT         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ UNIT_MULT_LABEL   : chr  "Units" "Units" "Units" "Units" ...
##  $ OBS_STATUS        : chr  "A" "A" "A" "A" ...
##  $ OBS_STATUS_LABEL  : chr  "Normal value" "Normal value" "Normal value" "Normal value" ...
##  $ OBS_CONF          : chr  "PU" "PU" "PU" "PU" ...
##  $ OBS_CONF_LABEL    : chr  "Public" "Public" "Public" "Public" ...
##  $ X2000             : num  12.1 NA NA NA NA NA NA NA NA NA ...
##  $ X2011             : num  12.4 4.59 7.6 4.41 2.9 ...
##  $ X2021             : num  4.3 7.38 10.9 3.39 4.6 ...
summary(diabetes)
##      FREQ            FREQ_LABEL          REF_AREA         REF_AREA_LABEL    
##  Length:256         Length:256         Length:256         Length:256        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   INDICATOR         INDICATOR_LABEL    UNIT_MEASURE       UNIT_MEASURE_LABEL
##  Length:256         Length:256         Length:256         Length:256        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  DATABASE_ID        DATABASE_ID_LABEL    UNIT_MULT UNIT_MULT_LABEL   
##  Length:256         Length:256         Min.   :0   Length:256        
##  Class :character   Class :character   1st Qu.:0   Class :character  
##  Mode  :character   Mode  :character   Median :0   Mode  :character  
##                                        Mean   :0                     
##                                        3rd Qu.:0                     
##                                        Max.   :0                     
##                                                                      
##   OBS_STATUS        OBS_STATUS_LABEL     OBS_CONF         OBS_CONF_LABEL    
##  Length:256         Length:256         Length:256         Length:256        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      X2000           X2011            X2021       
##  Min.   : 0.00   Min.   : 1.900   Min.   : 1.100  
##  1st Qu.:11.88   1st Qu.: 5.200   1st Qu.: 5.800  
##  Median :12.10   Median : 7.601   Median : 7.900  
##  Mean   :11.47   Mean   : 8.013   Mean   : 9.068  
##  3rd Qu.:13.88   3rd Qu.: 9.500   3rd Qu.:10.983  
##  Max.   :15.50   Max.   :25.300   Max.   :30.800  
##  NA's   :238     NA's   :7
colSums(is.na(diabetes))
##               FREQ         FREQ_LABEL           REF_AREA     REF_AREA_LABEL 
##                  0                  0                  0                  0 
##          INDICATOR    INDICATOR_LABEL       UNIT_MEASURE UNIT_MEASURE_LABEL 
##                  0                  0                  0                  0 
##        DATABASE_ID  DATABASE_ID_LABEL          UNIT_MULT    UNIT_MULT_LABEL 
##                  0                  0                  0                  0 
##         OBS_STATUS   OBS_STATUS_LABEL           OBS_CONF     OBS_CONF_LABEL 
##                  0                  0                  0                  0 
##              X2000              X2011              X2021 
##                238                  7                  0
head(obesity)
##   IndicatorCode
## 1   NCD_BMI_30C
## 2   NCD_BMI_30C
## 3   NCD_BMI_30C
## 4   NCD_BMI_30C
## 5   NCD_BMI_30C
## 6   NCD_BMI_30C
##                                                                        Indicator
## 1 Prevalence of obesity among adults, BMI &GreaterEqual; 30 (crude estimate) (%)
## 2 Prevalence of obesity among adults, BMI &GreaterEqual; 30 (crude estimate) (%)
## 3 Prevalence of obesity among adults, BMI &GreaterEqual; 30 (crude estimate) (%)
## 4 Prevalence of obesity among adults, BMI &GreaterEqual; 30 (crude estimate) (%)
## 5 Prevalence of obesity among adults, BMI &GreaterEqual; 30 (crude estimate) (%)
## 6 Prevalence of obesity among adults, BMI &GreaterEqual; 30 (crude estimate) (%)
##   ValueType ParentLocationCode        ParentLocation Location.type
## 1   numeric                WPR       Western Pacific       Country
## 2   numeric               SEAR       South-East Asia       Country
## 3   numeric                AFR                Africa       Country
## 4   numeric                AFR                Africa       Country
## 5   numeric                EMR Eastern Mediterranean       Country
## 6   numeric               SEAR       South-East Asia       Country
##   SpatialDimValueCode                              Location Period.type Period
## 1                 VNM                              Viet Nam        Year   2021
## 2                 LKA                             Sri Lanka        Year   2021
## 3                 AGO                                Angola        Year   2021
## 4                 CIV                         Cote d'Ivoire        Year   2021
## 5                 DJI                              Djibouti        Year   2021
## 6                 PRK Democratic People's Republic of Korea        Year   2021
##   IsLatestYear Dim1.type       Dim1 Dim1ValueCode Dim2.type       Dim2
## 1        false       Sex Both sexes      SEX_BTSX Age Group 18+  years
## 2        false       Sex Both sexes      SEX_BTSX Age Group 18+  years
## 3        false       Sex Both sexes      SEX_BTSX Age Group 18+  years
## 4        false       Sex Both sexes      SEX_BTSX Age Group 18+  years
## 5        false       Sex Both sexes      SEX_BTSX Age Group 18+  years
## 6        false       Sex Both sexes      SEX_BTSX Age Group 18+  years
##           Dim2ValueCode Dim3.type Dim3 Dim3ValueCode DataSourceDimValueCode
## 1 AGEGROUP_YEARS18-PLUS        NA   NA            NA                     NA
## 2 AGEGROUP_YEARS18-PLUS        NA   NA            NA                     NA
## 3 AGEGROUP_YEARS18-PLUS        NA   NA            NA                     NA
## 4 AGEGROUP_YEARS18-PLUS        NA   NA            NA                     NA
## 5 AGEGROUP_YEARS18-PLUS        NA   NA            NA                     NA
## 6 AGEGROUP_YEARS18-PLUS        NA   NA            NA                     NA
##   DataSource FactValueNumericPrefix FactValueNumeric FactValueUoM
## 1         NA                     NA             1.91           NA
## 2         NA                     NA            10.04           NA
## 3         NA                     NA            10.19           NA
## 4         NA                     NA            10.17           NA
## 5         NA                     NA            10.34           NA
## 6         NA                     NA            10.38           NA
##   FactValueNumericLowPrefix FactValueNumericLow FactValueNumericHighPrefix
## 1                        NA                1.48                         NA
## 2                        NA                8.62                         NA
## 3                        NA                6.76                         NA
## 4                        NA                8.18                         NA
## 5                        NA                5.57                         NA
## 6                        NA                5.46                         NA
##   FactValueNumericHigh           Value FactValueTranslationID FactComments
## 1                 2.42   1.9 [1.5-2.4]                     NA           NA
## 2                11.57 10.0 [8.6-11.6]                     NA           NA
## 3                14.26 10.2 [6.8-14.3]                     NA           NA
## 4                12.60 10.2 [8.2-12.6]                     NA           NA
## 5                16.65 10.3 [5.6-16.7]                     NA           NA
## 6                16.98 10.4 [5.5-17.0]                     NA           NA
##   Language             DateModified
## 1       EN 2024-02-28T14:00:00.000Z
## 2       EN 2024-02-28T14:00:00.000Z
## 3       EN 2024-02-28T14:00:00.000Z
## 4       EN 2024-02-28T14:00:00.000Z
## 5       EN 2024-02-28T14:00:00.000Z
## 6       EN 2024-02-28T14:00:00.000Z
str(obesity)
## 'data.frame':    398 obs. of  34 variables:
##  $ IndicatorCode             : chr  "NCD_BMI_30C" "NCD_BMI_30C" "NCD_BMI_30C" "NCD_BMI_30C" ...
##  $ Indicator                 : chr  "Prevalence of obesity among adults, BMI &GreaterEqual; 30 (crude estimate) (%)" "Prevalence of obesity among adults, BMI &GreaterEqual; 30 (crude estimate) (%)" "Prevalence of obesity among adults, BMI &GreaterEqual; 30 (crude estimate) (%)" "Prevalence of obesity among adults, BMI &GreaterEqual; 30 (crude estimate) (%)" ...
##  $ ValueType                 : chr  "numeric" "numeric" "numeric" "numeric" ...
##  $ ParentLocationCode        : chr  "WPR" "SEAR" "AFR" "AFR" ...
##  $ ParentLocation            : chr  "Western Pacific" "South-East Asia" "Africa" "Africa" ...
##  $ Location.type             : chr  "Country" "Country" "Country" "Country" ...
##  $ SpatialDimValueCode       : chr  "VNM" "LKA" "AGO" "CIV" ...
##  $ Location                  : chr  "Viet Nam" "Sri Lanka" "Angola" "Cote d'Ivoire" ...
##  $ Period.type               : chr  "Year" "Year" "Year" "Year" ...
##  $ Period                    : int  2021 2021 2021 2021 2021 2021 2021 2021 2021 2021 ...
##  $ IsLatestYear              : chr  "false" "false" "false" "false" ...
##  $ Dim1.type                 : chr  "Sex" "Sex" "Sex" "Sex" ...
##  $ Dim1                      : chr  "Both sexes" "Both sexes" "Both sexes" "Both sexes" ...
##  $ Dim1ValueCode             : chr  "SEX_BTSX" "SEX_BTSX" "SEX_BTSX" "SEX_BTSX" ...
##  $ Dim2.type                 : chr  "Age Group" "Age Group" "Age Group" "Age Group" ...
##  $ Dim2                      : chr  "18+  years" "18+  years" "18+  years" "18+  years" ...
##  $ Dim2ValueCode             : chr  "AGEGROUP_YEARS18-PLUS" "AGEGROUP_YEARS18-PLUS" "AGEGROUP_YEARS18-PLUS" "AGEGROUP_YEARS18-PLUS" ...
##  $ Dim3.type                 : logi  NA NA NA NA NA NA ...
##  $ Dim3                      : logi  NA NA NA NA NA NA ...
##  $ Dim3ValueCode             : logi  NA NA NA NA NA NA ...
##  $ DataSourceDimValueCode    : logi  NA NA NA NA NA NA ...
##  $ DataSource                : logi  NA NA NA NA NA NA ...
##  $ FactValueNumericPrefix    : logi  NA NA NA NA NA NA ...
##  $ FactValueNumeric          : num  1.91 10.04 10.19 10.17 10.34 ...
##  $ FactValueUoM              : logi  NA NA NA NA NA NA ...
##  $ FactValueNumericLowPrefix : logi  NA NA NA NA NA NA ...
##  $ FactValueNumericLow       : num  1.48 8.62 6.76 8.18 5.57 5.46 8.89 9.36 8.9 9.33 ...
##  $ FactValueNumericHighPrefix: logi  NA NA NA NA NA NA ...
##  $ FactValueNumericHigh      : num  2.42 11.57 14.26 12.6 16.65 ...
##  $ Value                     : chr  "1.9 [1.5-2.4]" "10.0 [8.6-11.6]" "10.2 [6.8-14.3]" "10.2 [8.2-12.6]" ...
##  $ FactValueTranslationID    : logi  NA NA NA NA NA NA ...
##  $ FactComments              : logi  NA NA NA NA NA NA ...
##  $ Language                  : chr  "EN" "EN" "EN" "EN" ...
##  $ DateModified              : chr  "2024-02-28T14:00:00.000Z" "2024-02-28T14:00:00.000Z" "2024-02-28T14:00:00.000Z" "2024-02-28T14:00:00.000Z" ...
summary(obesity)
##  IndicatorCode       Indicator          ValueType         ParentLocationCode
##  Length:398         Length:398         Length:398         Length:398        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  ParentLocation     Location.type      SpatialDimValueCode   Location        
##  Length:398         Length:398         Length:398          Length:398        
##  Class :character   Class :character   Class :character    Class :character  
##  Mode  :character   Mode  :character   Mode  :character    Mode  :character  
##                                                                              
##                                                                              
##                                                                              
##  Period.type            Period     IsLatestYear        Dim1.type        
##  Length:398         Min.   :2011   Length:398         Length:398        
##  Class :character   1st Qu.:2011   Class :character   Class :character  
##  Mode  :character   Median :2016   Mode  :character   Mode  :character  
##                     Mean   :2016                                        
##                     3rd Qu.:2021                                        
##                     Max.   :2021                                        
##      Dim1           Dim1ValueCode       Dim2.type             Dim2          
##  Length:398         Length:398         Length:398         Length:398        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Dim2ValueCode      Dim3.type        Dim3         Dim3ValueCode 
##  Length:398         Mode:logical   Mode:logical   Mode:logical  
##  Class :character   NA's:398       NA's:398       NA's:398      
##  Mode  :character                                               
##                                                                 
##                                                                 
##                                                                 
##  DataSourceDimValueCode DataSource     FactValueNumericPrefix FactValueNumeric
##  Mode:logical           Mode:logical   Mode:logical           Min.   : 0.74   
##  NA's:398               NA's:398       NA's:398               1st Qu.:10.23   
##                                                               Median :21.08   
##                                                               Mean   :21.80   
##                                                               3rd Qu.:28.45   
##                                                               Max.   :75.30   
##  FactValueUoM   FactValueNumericLowPrefix FactValueNumericLow
##  Mode:logical   Mode:logical              Min.   : 0.650     
##  NA's:398       NA's:398                  1st Qu.: 8.613     
##                                           Median :18.550     
##                                           Mean   :19.366     
##                                           3rd Qu.:25.085     
##                                           Max.   :68.860     
##  FactValueNumericHighPrefix FactValueNumericHigh    Value          
##  Mode:logical               Min.   : 0.84        Length:398        
##  NA's:398                   1st Qu.:12.31        Class :character  
##                             Median :23.84        Mode  :character  
##                             Mean   :24.38                          
##                             3rd Qu.:32.59                          
##                             Max.   :80.99                          
##  FactValueTranslationID FactComments     Language         DateModified      
##  Mode:logical           Mode:logical   Length:398         Length:398        
##  NA's:398               NA's:398       Class :character   Class :character  
##                                        Mode  :character   Mode  :character  
##                                                                             
##                                                                             
## 
colSums(is.na(obesity))
##              IndicatorCode                  Indicator 
##                          0                          0 
##                  ValueType         ParentLocationCode 
##                          0                          0 
##             ParentLocation              Location.type 
##                          0                          0 
##        SpatialDimValueCode                   Location 
##                          0                          0 
##                Period.type                     Period 
##                          0                          0 
##               IsLatestYear                  Dim1.type 
##                          0                          0 
##                       Dim1              Dim1ValueCode 
##                          0                          0 
##                  Dim2.type                       Dim2 
##                          0                          0 
##              Dim2ValueCode                  Dim3.type 
##                          0                        398 
##                       Dim3              Dim3ValueCode 
##                        398                        398 
##     DataSourceDimValueCode                 DataSource 
##                        398                        398 
##     FactValueNumericPrefix           FactValueNumeric 
##                        398                          0 
##               FactValueUoM  FactValueNumericLowPrefix 
##                        398                        398 
##        FactValueNumericLow FactValueNumericHighPrefix 
##                          0                        398 
##       FactValueNumericHigh                      Value 
##                          0                          0 
##     FactValueTranslationID               FactComments 
##                        398                        398 
##                   Language               DateModified 
##                          0                          0

Data Cleaning/ Reshaping

diabetes_1 <- diabetes %>%
  select(REF_AREA, REF_AREA_LABEL, X2011, X2021) %>%
  mutate(diabetes_change = X2021 - X2011) %>%
  rename(Location = REF_AREA_LABEL,
         diabetes_2021 = X2021,
         diabetes_2011 = X2011,
         code = REF_AREA)

obesity_1 <- obesity %>%
  select(SpatialDimValueCode, Location, Period, Value) %>%
  pivot_wider(
    names_from = Period,
    values_from = Value
    ) %>%
  rename(obesity_2021 = "2021",
         obesity_2011 = "2011",
         code = SpatialDimValueCode) %>%
  mutate(
    obesity_2011 = as.numeric(str_extract(obesity_2011, "^[0-9.]+")),
    obesity_2021 = as.numeric(str_extract(obesity_2021, "^[0-9.]+")),
    obesity_change = obesity_2021 - obesity_2011
  )

Exploratory Data Analysis

Global Diabetes Prevalance Average

diabetes_summary_avg <- diabetes_1 %>%
  summarise(
    dm_mean_2011 = mean(diabetes_2011, na.rm = TRUE),
    dm_mean_2021 = mean(diabetes_2021, na.rm = TRUE),
    dm_mean_change = mean(diabetes_change, na.rm = TRUE)
  )
print(diabetes_summary_avg)
##   dm_mean_2011 dm_mean_2021 dm_mean_change
## 1     8.012563     9.067681       1.007623

The global mean diabetes prevalence rose from 8.01% in 2011 to 9.07% in 2021.

Diabetes Prevalance Distribution

diabetes_long <- diabetes_1 %>%
  select(Location, diabetes_2011, diabetes_2021) %>%
  pivot_longer(cols = starts_with("diabetes"), 
               names_to = "Year", 
               values_to = "Prevalence") %>%
  mutate(Year = ifelse(Year == "diabetes_2011", "2011", "2021")) %>%
  filter(!is.na(Prevalence))

ggplot(diabetes_long, aes(Year, Prevalence)) +
  geom_boxplot(fill = "skyblue") +
  labs(title = "Diabetes Prevalence (2011 vs. 2021)", y = "Prevalence (%)", x = NULL) +
  theme_minimal()

The boxplot shows a clear upward shift in global diabetes prevalence from 2011 to 2021, with both the median and the overall distribution reaching higher levels. The increased number of outliers in 2021 highlights that more countries are facing extreme prevalence rates, some now exceeding 30%

Diabetes Prevalence Top20 countries (2021)

top20_DM_2021 <- diabetes_1 %>%
  arrange(desc(diabetes_2021)) %>%
  head(20)

ggplot(top20_DM_2021, aes(x = reorder(Location, diabetes_2021), y = diabetes_2021)) +
  geom_col(fill = "skyblue") +
  geom_text(aes(label = round(diabetes_2021,1)), 
            hjust = -0.1) +
  labs(title = "Top 20 Countries by Diabetes Prevalence in 2021",
       subtitle = "Prevalence among adults aged 20–79",
       x = NULL, y = "Prevalance(%)") +
  coord_flip() +
  theme_minimal()

Pakistan leads with a 30.8% prevalence, significantly outperforming other nations on the list. The top 20 is heavily concentrated in Pacific Island nations and the Middle East.

Diabetes Prevalence Change Overview

diabetes_1 %>%
  filter(!is.na(diabetes_change)) %>%
  ggplot(aes(y = diabetes_change)) +
  geom_boxplot(fill = "skyblue") +
  labs(title = "Distribution of Prevalence Change (2011-2021)", 
       y = "Change in Prevalence (%)", x = NULL) +
  theme_minimal() +
  theme(axis.text.x = element_blank())

The prevalence change boxplot shows that most countries experienced a modest increase, while several extreme outliers saw spikes of over 20 percentage points.

Diabetes Prevalence Change Top20 countries

top20_DMchange <- diabetes_1 %>%
  arrange(desc(abs(diabetes_change))) %>%
  head(20)

ggplot(top20_DMchange, aes(x = reorder(Location, abs(diabetes_change)), y = diabetes_change)) +
  geom_col(fill = "skyblue") +
  geom_text(aes(label = diabetes_change), hjust = ifelse(top20_DMchange$diabetes_change > 0, -0.1, 1.1)) +
  coord_flip() +
  labs(
    title = "Top 20 Countries by Diabetes Prevalence Change (2011–2021)",
    subtitle = "Prevalence among adults aged 20–79",
    y = "Prevalence Change (%)",
    x = NULL
  ) +
  theme_minimal()

While Pakistan shows a staggering 22.9% increase, the chart also reveals significant decreases in countries like Lebanon and Bahrain.

Global Obesity Prevalence Average

obesity_summary_avg <- obesity_1 %>%
  summarise(
    obesity_mean_2011 = mean(obesity_2011, na.rm = TRUE),
    obesity_2021 = mean(obesity_2021, na.rm = TRUE),
    obesity_change = mean(obesity_change, na.rm = TRUE)
  )
print(obesity_summary_avg)
## # A tibble: 1 × 3
##   obesity_mean_2011 obesity_2021 obesity_change
##               <dbl>        <dbl>          <dbl>
## 1              19.4         24.2           4.72

The obesity rate saw a significant 4.8 percentage point increase over ten years, growing from 19.4% to 24.2%.

Obesity Prevalance Distribution

obesity_long <- obesity_1 %>%
pivot_longer(
    cols = c(obesity_2011, obesity_2021),
    names_to = "Year",
    values_to = "Prevalence"
  ) %>%
  mutate(Year = str_replace(Year, "obesity_", "")) %>%
  filter(!is.na(Prevalence))

ggplot(obesity_long, aes(Year, Prevalence)) +
  geom_boxplot(fill = "skyblue") +
  labs(title = "Obesity Prevalence (2011 vs. 2021)", y = "Prevalence (%)", x = NULL) +
  theme_minimal()

The boxplot shows a clear upward shift in global obesity rates from 2011 to 2021. Both the median and overall distribution have moved higher.

Obesity Prevalence Change Overview

obesity_1 %>%
  filter(!is.na(obesity_change)) %>%
  ggplot(aes(y = obesity_change)) +
  geom_boxplot(fill = "skyblue") +
  labs(title = "Distribution of Prevalence Change (2011-2021)", 
       y = "Change in Prevalence (%)", x = NULL) +
  theme_minimal() +
  theme(axis.text.x = element_blank())

The median increase in obesity is around 4.7%, showing a universal upward trend. Most regions experienced a growth of 3% to 7% over the decade.

Obesity Prevalence Change Top20 countries (2021)

top20_Obesitychange <- obesity_1 %>%
  arrange(desc(abs(obesity_change))) %>%
  head(20)

ggplot(top20_Obesitychange, aes(x = reorder(Location, abs(obesity_change)), y = obesity_change)) +
  geom_col(fill = "skyblue") +
  geom_text(aes(label = obesity_change), hjust = ifelse(top20_Obesitychange$obesity_change > 0, -0.1, 1.1)) +
  coord_flip() +
  labs (
    title = "Top 20 Countries by Obesity Prevalance Change (2011–2021)",
    subtitle = "Prevalence among adults (18 years and older)",
    y = "Prevalence Change (%)",
    x = NULL
  ) +
  theme_minimal()

Romania recorded the largest increase in obesity prevalence, rising by over 13%. Several other nations, including Uzbekistan and Pakistan, also saw significant growth exceeding 9%.

Compare Diabetes and Obesity Prevalence

Merge dataset

data <-obesity_1 %>%
  left_join(diabetes_1, by = "code")

data <- data %>%
  select(
    code,
    Country = Location.x,
    obesity_2011, obesity_2021, obesity_change,
    diabetes_2011, diabetes_2021, diabetes_change
  )

nrow(obesity_1)
## [1] 199
nrow(data)
## [1] 199
colSums(is.na(data))
##            code         Country    obesity_2011    obesity_2021  obesity_change 
##               0               0               0               0               0 
##   diabetes_2011   diabetes_2021 diabetes_change 
##               6               3               6
data %>% 
  summarise(
    avg_obesity_2021 = mean(obesity_2021, na.rm = TRUE),
    avg_diabetes_2021 = mean(diabetes_2021, na.rm = TRUE)
  )
## # A tibble: 1 × 2
##   avg_obesity_2021 avg_diabetes_2021
##              <dbl>             <dbl>
## 1             24.2              8.89
missing_list <- data %>%
  filter(is.na(diabetes_change)) %>%
  select(code, Country)
print(missing_list)
## # A tibble: 6 × 2
##   code  Country       
##   <chr> <chr>         
## 1 GRL   Greenland     
## 2 NIU   Niue          
## 3 COK   Cook Islands  
## 4 TKL   Tokelau       
## 5 SSD   South Sudan   
## 6 ASM   American Samoa
data_clean <- data %>%
  drop_na(obesity_change, diabetes_change)

Correlation and Regression Analysis: Obesity vs. Diabetes (2021)

cor.test(data_clean$diabetes_2021, data_clean$obesity_2021)
## 
##  Pearson's product-moment correlation
## 
## data:  data_clean$diabetes_2021 and data_clean$obesity_2021
## t = 8.3553, df = 191, p-value = 1.32e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4057778 0.6137577
## sample estimates:
##       cor 
## 0.5173666
model_2021 <- lm(diabetes_2021 ~ obesity_2021, data = data_clean)
summary(model_2021)
## 
## Call:
## lm(formula = diabetes_2021 ~ obesity_2021, data = data_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.3721 -3.1511 -0.9643  2.1914 22.4557 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.07565    0.65743   6.199 3.42e-09 ***
## obesity_2021  0.20621    0.02468   8.355 1.32e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.436 on 191 degrees of freedom
## Multiple R-squared:  0.2677, Adjusted R-squared:  0.2638 
## F-statistic: 69.81 on 1 and 191 DF,  p-value: 1.32e-14
ggplot(data_clean, aes(x = obesity_2021, y = diabetes_2021)) +
  geom_point() +
  labs(title = "Relationship between Obesity and Diabetes Prevalence (2021)",
       x = "Obesity prevalence",
       y = "Diabetes prevalence"
         ) +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

The upward-sloping regression line shows a positive correlation between obesity and diabetes in 2021. However, significant outliers indicate that other factors also influence diabetes rates.

Finding Exceptions

data_clean <- data_clean %>%
  mutate(res_2021 = residuals(model_2021))

Identifying Positive Outliers: Higher Than Predicted

top_positive_res <- data_clean %>%
  arrange(desc(res_2021)) %>%
  head(5) %>%
  select(Country, obesity_2021, diabetes_2021, res_2021)
print(top_positive_res)
## # A tibble: 5 × 4
##   Country         obesity_2021 diabetes_2021 res_2021
##   <chr>                  <dbl>         <dbl>    <dbl>
## 1 Pakistan                20.7          30.8     22.5
## 2 Mauritius               19.1          22.6     14.6
## 3 Sudan                   14.8          18.9     11.8
## 4 Kuwait                  45            24.9     11.5
## 5 Solomon Islands         21.2          19.8     11.4

These nations show significantly higher diabetes rates than predicted by their obesity levels. Pakistan shows the highest positive residual, with diabetes rates soaring far beyond what its obesity levels would suggest.

Identifying Negative Outliers: Lower Than Predicted

top_negative_res <- data_clean %>%
  arrange(res_2021) %>%
  head(5) %>%
  select(Country, obesity_2021, diabetes_2021, res_2021)
print(top_negative_res)
## # A tibble: 5 × 4
##   Country    obesity_2021 diabetes_2021 res_2021
##   <chr>             <dbl>         <dbl>    <dbl>
## 1 Samoa              60.6           9.2    -7.37
## 2 Ireland            30.4           3      -7.34
## 3 Croatia            34.7           4.8    -6.43
## 4 Georgia            38             5.7    -6.21
## 5 Mauritania         20.2           2.1    -6.14

These nations show significantly lower diabetes rates than predicted by their obesity levels. Samoa and Ireland exhibit the largest negative residuals.

Conclusion

Our analysis confirms that the rising prevalence of diabetes and obesity is indeed a significant global burden. However, the intensity of this burden varies considerably across the countries studied.

By tracking data from 2011 to 2021, we found that while national obesity rates are strong indicators of current diabetes prevalence, they do not fully explain the decade-long growth observed in every nation. This suggests that the diabetes epidemic is driven by a complex interplay of factors, including genetics, dietary habits, and socioeconomic conditions.