This vignette shows the usage of the tidycharts package. It contains different chart types examples and tips for proper data visualization.
This package and vignette are created for a user who:
Bar charts should be used to show structure in one moment of time. One of typical usecase of the barchart is to visualize profit of a company in a division by departments. The data structure could look the following:
data_barchart <- data.frame( dep = c('Services', 'Production', 'Marketing', 'Purchasing'), profit = c(17, 15, 2, -3), operational = c(9, 7, 1.5, -0.4), property = c(4, 4, 0.5, -0.6), bonus = c(4, 4, 0, -2), prev_year = c(10, 16, 4, -1), plan = c(11, 13, 2, -2.5) )
In the example data
bonus are parts of
profit and sum up to it.
Creation of the barchart is simple. We use
barchart_plot function to do that. After calling the function chart will be automatically printed. It can also be assigned to a variable as one element character vector with SVG content.
A plot should contain an informative title. We can use
add_title function to make one. We can chain the commands by pipe operator (
bar_chart(data = data_barchart, cat = 'dep', series = 'profit') %>% add_title('The company XYZ', 'Profit', 'in mEUR', 'by departments, 2020')
We can show the structure of each value by specifing different
series argument. It can be a vector of column names. It that case, stacked barchart will be generated.
Normalized barplot should be used to show the proportions of parts in each category. Typical intention of using this kind of plot could be to visualize the percentage structure of profit among different departments in a company.
We use reference values (indices) to show a reference value on the plot. In the following example, index line is used to show the best result in previous year (PY).
To visualize 2 or 3 series of data, which do not sum up to some value, grouped barchart should be used. First series is visualized by bars in the foreground, second by bars in the background and third in the form of triangles. Style of the bars and triangles indicates type of data, so called scenarios.
The most typical usecase of this chart is to visualize profit of different departments in a company with comparison to budget and previous year data.
# names of columns in styles data frame are not important, # only order of column is styles <- data.frame(style_foreground = rep('actual', 4), style_background = rep('previous',4), style_markers = rep('plan', 4)) bar_chart_grouped(data = data_barchart, cat = 'dep', foreground = 'profit', background = 'prev_year', markers = 'plan', series_labels = c('PY', 'AC', 'PL'), styles = styles) %>% add_title('The company XYZ', 'Profit', 'in mEUR', 'compared to different scenarios')
Show variance in data using relative variance plots or absolute variance plots. Define the baseline and the real values. Axis on variance plots use styles to scenario of baseline data. Relative variance plot shows difference in percents and absolute variance plot shows it in base units.
Example usage: Visualize difference between two scenarios in division by departments.
bar_chart_absolute_variance(data = data_barchart, cat = 'dep', baseline = 'plan', real = 'profit', y_title = 'Plan vs. actual', y_style = 'plan') %>% add_title('The company XYZ', 'Profit variance', 'in mEUR', 'between plan and actual')
For demonstration we will use
mtcars dataset available in R as built-in.
Scatter plots, also known as point plots, are used to visualize multidimensional relationships between variables. Therefore, they are extensively used in exploratory data analysis.
In scatter plot 2 numerical dimensions are visualized by position of a point on the Cartesian plane.
scatter_plot(scatter_data, x = 'hp', y = 'qsec', legend_title = '', x_names = c('Horsepower', 'in hp'), y_names = c('1/4 mile time', 'in s')) %>% add_title('The mtcars dataset', '', '', '')
Optionally, categorical dimension can be added in a form of a point color.
Bubble plots can visualize the same dimensions as scatter plots. However even third numeric dimension can be added in a form of point size. On the other hand, there is a tradeoff between dimensionality and size of your data and readability of generated plots, so be careful when using bubble charts.
Charts with vertical columns are intended to visualize time series data. What is worth noticing, column width depends on the x-axis interval. The longer the interval, the wider the column. General guideline for this kind of chart is to plot up to 24 columns. If your data has more than 24 time points see line chart section.
Here is how an example column chart data frame could look like:
data_time_series <- data.frame( time = month.abb, Poland = 2 + 0.5 * sin(1:12), Germany = 3 + sin(3:14), Slovakia = 2 + 2 * cos(1:12) )
time column consists of the three-letter abbreviations for the English month names and other columns consist of some artificial data, it could be for example sales in different countries.
Use basic column chart to make a simple visualization of a time series. Pass
interval parameter to change the width of columns.
Typical task related to this kind of plot could be the following: Show sales from different countries over the months.
To visualize contribution waterfall charts can be used. We need to transform the data a little bit to before passing it into the plotting function.
Example usage: visualize contribution of monthly sales as a part of year sales.
data_time_series %>% group_by(time) %>% summarise(Sales = sum(Poland, Germany, Slovakia)) %>% arrange(match(time, month.abb)) %>% mutate(Sales = round(cumsum(Sales), 2)) -> df_summarized column_chart_waterfall(df_summarized, x = 'time', series = 'Sales') %>% add_title('The company XYZ', 'Profit', 'in mEUR', 'cumulative, 2020')
Other types of column charts are available, ie.
column_chart_normalized. When using them similar data visualization rules apply as for bar charts. Feel free to explore them and see reference page if need help.
Line charts, as column charts, should be used to show time series data. Some lineplots however require more complicated data structure.
The basic lineplot uses lines with markers to show the data. Typical usage is to visualize several data series, which do not sum up, for example the market value of different companies among the years.
set.seed(123) data_companies <- data.frame( time = 2010:2020, Alpha.inc = 25 + round(1:11 + rnorm(11)), Beta.inc = round(40 + rnorm(11, sd = 2)), Gamma.inc = round(50 + rnorm(11, sd = 5)) ) line_chart_markers(data_companies, x = 'time', series = c('Alpha.inc', 'Beta.inc', 'Gamma.inc'), series_labels = c('Alpha.inc', 'Beta.inc', 'Gamma.inc'), interval = 'years') %>% add_title('Some companies', 'Market value', 'in mEUR', '2010...2020')
Use dense line plot to visualize up to 6 time series with more than one point in a category on x-axis. The more advanced users are encouraged to use the
line_chart_dense_custom function, where they can choose points that will be highlighted by value label.
The most typical example is to show data with time granularity of 1 day among the years (mean day temperature in the course of 16 months).
data_dense <- data.frame( dates = seq.Date(as.Date('2019-01-01'), as.Date('2019-12-31'), by = 1), Warsaw = 7 + 9 * sin((365:1 - 60)/ 365 * 2 * pi) + rnorm(365, sd = 2), London = 8 + 5 * sin((365:1 - 55)/ 365 * 2 * pi) + rnorm(365, sd = 1.2) ) line_chart_dense(data_dense, 'dates', c('Warsaw', 'London')) %>% add_title('Temperature in European Cities', 'Daily mean', 'in deg. C', 'In 2019')
One can wonder what type of plot choose: line chart or column chart. The answer depends on the data. If you want to visualize only one series, both line and column chart are appropriate. More differences occur when number of series increases. If the sum of series means something reasonable, use stacked columns, optionally stacked lines. If not, use line plot.