December 14, 2022

Prepare Library

library(ggplot2)

library(tidyverse)

Data Types, Why Care?

  • data types in daily life are simple
    • numbers
    • words
    • logical statements
  • data types in R are similar
    • categorises data in different classes
    • different in terms of syntax

Data Types in R

  1. Numeric (1, 2.1, 40, 600.8)
  2. Integer (1, 2, 3, 4, 5)
  3. Logical (TRUE / FALSE)
  4. Character (“Christmas”, “b”)
  5. Factor (categorical variables)
  6. Complex (i + 9)

Numeric Data

  • numbers, written as integers or decimals
#make vector
n <- c(2, 5.3, 9.7, 15.2)

Is Numeric?

Method 1

is.numeric(n)
## [1] TRUE

Method 2

class(n)
## [1] "numeric"

Mathematical operations

n
## [1]  2.0  5.3  9.7 15.2
n + 9
## [1] 11.0 14.3 18.7 24.2
n- 1.5
## [1]  0.5  3.8  8.2 13.7
n * 8
## [1]  16.0  42.4  77.6 121.6

Integer

  • numbers, without decimals
  • R will make a vector with decimals automatically numeric
i <- c(9, 11, 7, 5, 18)
class(i)
## [1] "numeric"

But you can change that:

i <- as.integer(i)
class(i)
## [1] "integer"

Create Integer Vector Directly

i_direct <- c(9L, 11L, 7L, 5L, 18L)
i_direct
class(i_direct)
## [1]  9 11  7  5 18
## [1] "integer"
  • Some functions will automatically generate integer vectors
  • sample ()

Character

  • store text in R
    • strings
    • ” ”
ch <- c("Make", "character", "vector")
ch

class(ch)
## [1] "Make"      "character" "vector"   
## [1] "character"

Number Characters

  • Also makes numbers into characters
c_n <- c("5", "9", "11", "13")
class(c_n)
## [1] "character"
  • No math possible!
mean(c_n)
## Warning in mean.default(c_n): argument is not numeric or logical: returning NA
## [1] NA

Commmon Number Character Issue

  • warning when you accidentally have a character value in column of numbers
  • space before a number
  • circumvent with as.numeric
    • gives NA values
  • change data set!

Logical

  • TRUE or FALSE
  • T or F
  • true or false or t or f do not work!

Logical Example

numbers <- c(13, 8, 21, 37, 5)

#Ask R if elements in vector are greater than 9
nine <- numbers > 9
nine
## [1]  TRUE FALSE  TRUE  TRUE FALSE
#check class
class(nine)
## [1] "logical"

Create Logical Vector

l <- c(T, T, F, T,F)
l
## [1]  TRUE  TRUE FALSE  TRUE FALSE
#make numeric
l <- as.numeric(l)
l
## [1] 1 1 0 1 0

Do Math

l <- as.logical(l)
l
## [1]  TRUE  TRUE FALSE  TRUE FALSE
sum(l)
## [1] 3

Factor

  • used for repeating categories
    • categorical variables

Example Factor

data <- data.frame(ID = c("Jane", "Matt", "Dan", "Karen", "Harold"),
                   age = c(35, 25, 30, 27, 29),
                   sex = c("female", "male", "male", "female", "male"))

## view data structure
str(data)
## 'data.frame':    5 obs. of  3 variables:
##  $ ID : chr  "Jane" "Matt" "Dan" "Karen" ...
##  $ age: num  35 25 30 27 29
##  $ sex: chr  "female" "male" "male" "female" ...

Change Data Type

  • column sex is character vector
    • want as categorical variable with level m/f
data$sex <- as.factor(data$sex)
data$sex
## [1] female male   male   female male  
## Levels: female male
# rest of data is unchanged
str(data)
## 'data.frame':    5 obs. of  3 variables:
##  $ ID : chr  "Jane" "Matt" "Dan" "Karen" ...
##  $ age: num  35 25 30 27 29
##  $ sex: Factor w/ 2 levels "female","male": 1 2 2 1 2

Factor Levels

  • R lists levels in alphabetical order
finish <- factor(c("first", "second", "fourth", "second", "first", "third", "fifth"))
finish
## [1] first  second fourth second first  third  fifth 
## Levels: fifth first fourth second third
  • change level order
finish <- factor(finish, levels = c("first", "second", "third", "fourth", "fifth"))
finish
## [1] first  second fourth second first  third  fifth 
## Levels: first second third fourth fifth

Numbers as Factor

  • Useful to check your data!
    • especially when you read from .csv
  • Use: class() str()

Violin Plots

  • Box plots
  • Also show kernel probability density of data
  • Marker for median of data
  • Box indicating interquartile range
    • like standard box plot

Violin Plots Data

  • x-axis needs to be numerical
    • check data
    • convert variable if necessary
str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Basic Violin Plot

ggplot(diamonds)+
  geom_violin(aes(x = cut, y = carat))

Basic Violin Plot

ggplot(diamonds)+
  geom_violin(aes(x = cut, y = carat))

Rotate Violin Plot

gplot(diamonds) +
  geom_violin(aes(x = cut, y = carat)) + 
  coord_flip()

Rotate Violin Plot

ggplot(diamonds) +
  geom_violin(aes(x = cut, y = carat)) + 
  coord_flip()

Add Summary Statistics: Mean

ggplot(diamonds, aes(x = cut, y = carat)) +
  geom_violin() + 
  stat_summary(fun = mean, geom = "point", shape = 8, size = 5, colour = "darkblue")

Add Summary Statistics: Mean or Median)

ggplot(diamonds, aes(x = cut, y = carat)) +
  geom_violin() + 
  stat_summary(fun = mean, geom = "point", shape = 8, size = 5, colour = "darkblue")

Add Mean and Standard Deviation

  • use mean_sdl
    • mean +/- constant x SD
  • plot as crossbar or pointrange
stat <- function(x) {
  ave <- mean(x)
  ymin <- ave- sd(x)
  ymax <- ave + sd(x)
  return(c(y = ave, ymin = ymin, ymax = ymax))
}

Add Mean and Standard Deviation

ggplot(diamonds, aes(x = cut, y = carat)) +
  geom_violin() +
  stat_summary(fun.data = stat)

Add Median and Quartile

ggplot(diamonds, aes(x = cut, y = carat)) +
  geom_violin() + 
  geom_boxplot(width = 0.1)

Add Median and Quartile

ggplot(diamonds, aes(x = cut, y = carat)) +
  geom_violin() + 
  geom_boxplot(width = 0.1)

Change Colour

ggplot(diamonds, aes(x = cut, y = carat, colour = cut)) +
  geom_violin() +
  stat_summary(fun.data = stat)

More options

  • change fill colour
  • change legend potion
    • (theme)
  • change order
  • multiple groups