# Introduction

This package provides mechanisms for classifying border crossings using a rules-based methodology. The goal of performing this type of classification is to identify any potential long-term migrants. A long-term migration is defined as a border crossing involving a change in residence status. A border crossing counts as a long-term migration to/from a country if it entails a change from non-residence to residence or residence to non-residence. The rules-based classification that used to determine a long-term migration is defined by a threshold duration and a test duration, alternatively named window size. Under a 12/16 rule, for instance, the threshold duration is 12 months and the test duration (window size) is 16 months. With a 9/12 rule, the threshold duration is 9 months and the test duration (window size) is 12 months. This vignette demonstrates the use of the functions, by calculating the days of residence and migration statuses, with a small made-up set of border crossings. For more information about the methodology applied, please visit Stats NZ (2020) https://www.stats.govt.nz/methods/defining-migrants-using-travel-histories-and-the-1216-month-rule.

# Outcome-based rules for classifying migration

Rules based classification simultaneously assign residence status and migration status, based on the direction of the border crossing, the person’s residence status before the crossing, and the length of time spent in or out of the country after the crossing. Obviously, the Outcome-based rules need to deal simultaneously with two sorts of units: border crossings and people, and hence, the input data must contain information about the two units. The following outlines the main table structure of the input data.

### The format of input data (Border Crossing Information data):

journeyId personId is_arrival date_crossing journey_sequence journeyId_prev
<Integer> <Integer> <Integer> <Character> <Integer> <Integer>
The id of a journey The id of a person The direction of the border crossing, i.e., 0:departure, 1:arrival The date of border crossing, such as 2012-01-01 The order of multiple journeys made by the person The previous journey id

Sometimes, a piece of data about residences’ initial status is required. A residence initial status is referred to as the residence status before he/she made the first journey in the target country, i.e., New Zealand.

### The format of initial residence status data:

personId res_status_initial date_finalised
<Integer> <Integer> <Character>
The id of a person The initial residence status: 0: Non-resident, 1: resident The date of the residence status confirmed, such as 2010-01-01

A comprehensive rules-based methodology is outlined as follows:

# Using the migrbc Package

The function ‘run_rbc’ is the main function of the package ‘migrbc’ for users to classify their migrants based on the rules defined in the algorithm section. The following table lists the parameters of the function.

### Parameters of the Function run_rbc:

Parameters Description
crossing_data A pre-processed group data contain journeys, movements and IRS or the raw crossing data.
init_res_status_data The raw data of the initial residence status in the format of data frame.
window_size The maximum length of the scanning period. Can be an integer giving the number of days, the result of a call to function difftime, or an object of class Duration.
threshold_year The length of the yearly test period. Can be an integer giving the number of days, the result of a call to function difftime, or an object of class Duration.
parallel Logical. Whether to use parallel processing, to speed up the calculation of migration statuses. Defaults to TRUE.
n_core The number of cores to use, if parallel is TRUE. Defaults to 2. Higher values will typically result in faster calculations on computers with more than two cores.
max_ram Optional, it is used to limit the RAM that can be used by this function. The default value is 5 Gb.
include_error_columns Optional, if it is TRUE, the returned result of error_data will contain two extra columns error_code and error_message.
mc.cleanup Optional, if set to TRUE then all children that have been forked by this function will be killed (by sending SIGTERM) before this function returns. Under normal circumstances mclapply waits for the children to deliver results, so this option usually has only effect when mclapply is interrupted. If set to FALSE then child processes are collected, but not forcefully terminated. As a special case this argument can be set to the number of the signal that should be used to kill the children instead of SIGTERM.

The parameter crossing_data can be a dataframe object or the data outputted from the usage of the function pre_process. The functionality of pre_process is to divide a large dataset into a number of sub datasets. The function pre_process is useful if the raw data is extreme large, for example, if we have a data of more than 1 million records, then it is better to divide the raw data into small subsets (n_groups = 10, 10 subgroups). The following lists the parameters of the function pre_process.

### Parameters of the Function pre_process:

Parameters Description
data A dataframe object
init_res_status_data The raw data of the initial residence status in the format of data frame.
n_groups The number of sub datasets that is likely to be returned.

In this document, we provide a few examples to outline the main functionalities provided by the package. The first thing we need to do is attaching the package migrbc, knitr and parallel to your R studio as follows:

## Setting up a Logger (Optional)

We applied futile.logger for troubleshooting and hence, provided an utility function for initializing the logger. More information about the logger can be found with the link (https://CRAN.R-project.org/package=futile.logger). The parameter log_level is a number in the set of 1, 2, 4, 6, 8, and 9:

futile.logger::FATAL: 1
futile.logger::ERROR: 2
futile.logger::WARN:  4
futile.logger::INFO:  6
futile.logger::DEBUG: 8
futile.logger::TRACE: 9

If we would like to output the log information to a file, then you can set the parameter log_path to your own file path. By default, the log_level is set to 1 and log_path is set to NULL.

## Creating Dummy Data

We use some random made-up data, which is generated by the function setup_random_test_data in the migrbc package. The first argument is the number of people we would like to generate; the second argument is the initial start date; the third argument is the number of journeys a person has; min and max define the random duration gap between two journeys’ crossing dates.

## Defining Threshold and Test Duration

In this vignette, we use a 12/16 rule. In other words, the threshold period is 12 months, and the test period (window size) is 16 months - approximately. The “approximately” is necessary because in practice we define the threshold and test periods using days rather than months. Months are difficult to work with, since their lengths vary in complicated ways. The package defines these parameters as a number of days. Instead of 12 months we use 365 days, and instead of 16 months we use 487 days:

If we prefer to process it by a csv file (not provided by the ‘migrbc’), which contains all required columns:

journeyId personId is_arrival date_crossing journey_sequence
1 1 1 2001-01-01 1
2 1 0 2001-01-06 2
3 1 1 2001-01-14 3
4 1 0 2001-02-04 4
5 1 1 2001-02-27 5
6 1 0 2001-04-11 6

## Pre-process RBC

As mentioned in the beginning of this section, this process is ‘optional’ if we want to divide the input data into very small subgroups. It is useful to save memory space for a very large dataset (see Case 1-1). If we choose not to divide the data into a custom size of subgroups, we can pass the person_data directly to the main function, i.e., the main function for process RBC (see Case 1-2).

## Process RBC

### Case 1-1: classifying rules with divided sub-datasets (pre-processed)

The main function run_rbc can process raw data (a data frame object) directly if there is no need to create your own number of subgroups. In this case, the function run_rbc will divide the raw data with n_groups = n_core.

## Post-process

The result returned from the function run_rbc (migrbc) contains two lists of person instances, i.e., good people list without error and bad people list with error.

journeyId personId res_status_before res_status_after is_long_term_mig
1 1 0 0 0
2 1 0 0 0
3 1 0 0 0
4 1 0 0 0
5 1 0 0 0
6 1 0 0 0

## Including error columns

### Case 2: 12/16 small made up data with error

By default, the option ‘include_error_columns’ is FALSE. To view the columns ‘error_code’ and ‘error_message’ in the ‘error_data’, we need to set the option ‘include_error_columns’ to TRUE.

journeyId personId res_status_before res_status_after is_long_term_mig
3 3 2 0 0 0
4 4 2 0 0 0
journeyId personId error_code error_message
1 1 2 ‘is_arrival’ cannot be identical to previous one.
2 1 2 ‘is_arrival’ cannot be identical to previous one.

# Summary

The goal of this document is to provide comprehensive examples to guide potential users to use the package effectively and efficiently. Some background information and the main algorithm are introduced. All information provided belongs to Stats NZ.

# Appendix - Plots of migration histories

Package migrbc contains a function called plot_mig_hist for plotting migration histories. Plots for the one person in random are generated as follows:

plot_test <- function(mig_hist) {
plot_mig_hist(date_crossing = as.character(mig_hist$date_crossing), is_arrival = mig_hist$is_arrival,
days_to_next_crossing = mig_hist$days_to_next_crossing, show_date = FALSE, cex = 0.8) } number_of_people <- 1 person_data <- migrbc::setup_random_test_data(number_of_people, initial_date = "2001-01-01", numJourneys = 3, min = 0, max = 100) ## run in non-parallel post_data <- migrbc::run_rbc(person_data, window_size = 487, threshold_year = 365, parallel=FALSE) ## plot good result old_par <- par(mfrow = c(1, 1)) plot_test(post_data$journeys)

par(old_par)