Help for ftools

Title

FTOOLS Mata commands for factor variables

Syntax

class Factor scalar factor({space 3}varnames [{space 1} , touse, verbose, method, sort_levels, count_levels, hash_ratio, save_keys])

class Factor scalar _factor(data [, integers_only, verbose, method, sort_levels, count_levels, hash_ratio, save_keys])

class Factor scalar join_factors(F1, F2 [, count_levels, save_keys, levels_as_keys])

Options Description
* string varnames names of variables that identify the factors
string touse name of dummy touse variable
note: you can also pass a vector with the obs. index (i.e. the first argument of st_data())
string data transmorphic matrix with the group identifiers
Advanced options:
real verbose 1 to display debug information
string method hashing method: mata, hash0, hash1, hash2; default is mata (auto-choose)
real sort_levels set to 0 under hash1 to increase speed, but the new levels will not match the order of the varlist
real count_levels set to 0 under hash0 to increase speed, but the F.counts vector will not be generated so F.panelsetup(), F.drop_obs(), and related methods will not be available
real hash_ratio size of the hash vector compared to the maximum number of keys (often num. obs.)
real save_keys set to 0 to increase speed and save memory, but the matrix F.keys with the original values of the factors won't be created
string integers_only whether data is numeric and takes only integers or not (unless you are sure of the former, set it to 0)
real levels_as_keys if set to 1, join_factors() will use the levels of F1 and F2 as the keys (as the data) when creating F12

Creating factor objects

(optional) First, you can declare the Factor object:

class Factor scalar F

Then, you can create a factor from one or more categorical variables:

F = factor(varnames)

If the categories are already in Mata (data = st_data(., varnames)), you can do:

F = _factor(data)

You can also combine two factors (F1 and F2):

F = join_factors(F1, F2)

Note that the above is exactly equivalent (but faster) than:

varnames = invtokens((F1.varnames, F2.varnames))
F = factor(varnames)

If levels_as_keys==1, it is equivalent to:

F = _factor((F1.levels, F2.levels))

Properties and Methods

properties Description
real F.num_levels number of levels (distinct values) of the factor
real F.num_obs number of observations of the sample used to create the factor (c(N) if touse was empty)
real colvector F.levels levels of the factor; dimension F.num_obs x 1; range: {cmd:{1, ..., F.num_levels}}{p_end} {synopt None}transmorphic matrix F.keysvalues of the input varlist that correspond to the factor levels; dimension F.num_levels x 1; not created if save_keys==0; unordered if sort_levels==0
real vector F.counts frequencies of each level (in the sample set by touse); dimension F.num_levels x 1; will be empty if count_levels==0
string rowvector F.varlist name of variables used to create the factor
string rowvector F.varformats formats of the input variables
string rowvector F.varlabels labels of the input variables
string rowvector F.varvaluelabels value labels attached to the input variables
string rowvector F.vartypes types of the input variables
string rowvector F.vl value label definitions used by the input variables
string F.touse name of touse variable
string F.is_sorted 1 if the dataset is sorted by F.varlist
main methods Description
void F.store_levels(newvar) save the levels back into the dataset (using the same touse)
void F.store_keys([sort]) save the original key variables into a reduced dataset, including formatting and labels. If sort is 1, Stata will report the dataset as sorted
void F.panelsetup() compute auxiliary vectors F.info and F.p (see below); used in panel computations
ancilliary methods Description
real scalar F.equals(F2) 1 if F represents the same data as F2 (i.e. if .num_obs .num_levels .levels .keys and .counts are equal)
real scalar F.nested_within(vec) 1 if the factor F is nested within the column vector vec (i.e. if any two obs. with the same factor level also have the same value of vec). For instance, it is true if the factor F represents counties and vec represents states.
void F.drop_obs(idx) update F to reflect a change in the underlying dataset, where the observations listed in the column vector idx are dropped (see example below)
void F.keep_obs(idx) equivalent to keeping only the obs. enumerated by idx and recreating F; uses .drop_obs()
void F.drop_if(vec) equivalent to dropping the obs. where vec==0 and recreating F; uses .drop_obs()
void F.keep_if(vec) equivalent to keeping the obs. where vec!=0 and recreating F; uses .drop_obs()
real colvector F.drop_singletons() equivalent to dropping the levels that only appear once, and their corresponding observations. The colvector returned contains the observations that need to be excluded (note: see the source code for some advanced optional arguments).
real scalar F.is_id() 1 if F.counts is always 1 (i.e. if F.levels has no duplicates)
real vector F.intersect(vec) return a mask vector equal to 1 if the row of vec is also on F.keys. Also accepts the integers_only and verbose options: mask = F.intersect(y, 1, 1)
available after F.panelsetup() Description
transmorphic matrix F.sort(data) equivalent to data[F.p, .] but calls F.panelsetup() if required; data is a transmorphic matrix
transmorphic matrix F.invsort(data) equivalent to data[invorder(F.p), .], so it undoes a previous sort operation. Note that F.invsort(F.sort(x))==x. Also, after used it fills the vector F.inv_p = invorder(F.p) so the operation can be repeated easily.
void F._sort(data) in-place version of .sort(); slower but uses less memory, as it's based on _collate()
real vector F.info equivalent to panelsetup() (returns a (num_levels X 2) matrix with start and end positions of each level/panel).
note: instead of using F.info directly, use panelsubmatrix(): x = panelsubmatrix(X, i, F.info) and panelsum()(see example at the end)
real vector F.p equivalent to order(F.levels) but implemented with a counting sort that is asymptotically faster (O(N) instead of O(N log N).
note: do not use F.p directly, as it will be missing if the data is already sorted by the varnames.

Notes:

- If you just downloaded the package and want to use the Mata functions directly (instead of the Stata commands), run . ftools once to, which creates the Mata library if needed.
- To force compilation of the Mata library, type . ftools, compile
- F.extra is an undocumented asarray that can be used to store additional information: asarray(f.extra, "lorem", "ipsum"); and retrieve it: ipsum = asarray(f.extra, "lorem")
- join_factors() is particularly fast if the dataset is sorted in the same order as the factors
- factor() will call join_factors() if appropriate (2+ integer variables; 10,000+ obs; and method=hash1)

Description

The Factor object is a key component of several commands that manipulate data without having to sort it beforehand:

- fcollapse (alternative to collapse, contract, collapse+merge and some egen functions)

- fegen group

- fisid

- join and fmerge (alternative to m:1 and 1:1 merges)

- flevelsof plug-in alternative to levelsof

- fsort (note: this is O(N) but with a high constant term)

- freshape

Ancilliary commands include:

- local_inlist return local inlist based on a variable and a list of values or labels

It rearranges one or more categorical variables into a new variable that takes values from 1 to F.num_levels. You can then efficiently sort any other variable by this, in order to compute groups statistics and other manipulations.

For technical information, see [1] [2], and to a lesser degree [3].

Usage

If you only want to create identifiers based on one or more variables, run something like:

{inp None}
sysuse auto, clear mata: F = factor("foreign turn") mata: F.store_levels("id") mata: mata drop F
{txt None}

More complex scenarios would involve some of the following:

{inp None}
sysuse auto, clear * Create factors for foreign data only mata: F = factor("turn", "foreign") * Report number of levels, obs. in sample, and keys mata: F.num_levels mata: F.num_obs mata: F.keys, F.counts * View new levels mata: F.levels[1::10] * Store back new levels (on the same sample) mata: F.store_levels("id") * Verify that the results are correct sort id li turn foreign id in 1/10
{txt None}

Example: operating on levels of each factor

This example shows how to process data for each level of the factor (like bysort). It does so by combining F.sort() with panelsubmatrix().

In particular, this code runs a regression for each category of turn:

{inp None}
clear all mata: real matrix reg_by_group(string depvar, string indepvars, string byvar) { class Factor scalar F real scalar i real matrix X, Y, x, y, betas F = factor(byvar) Y = F.sort(st_data(., depvar)) X = F.sort(st_data(., tokens(indepvars))) betas = J(F.num_levels, 1 + cols(X), .) for (i = 1; i <= F.num_levels; i++) { y = panelsubmatrix(Y, i, F.info) x = panelsubmatrix(X, i, F.info) , J(rows(y), 1, 1) betas[i, .] = qrsolve(x, y)' } return(betas) } end sysuse auto mata: reg_by_group("price", "weight length", "foreign")
{text None}

Example: Factors nested within another variable

You might be interested in knowing if a categorical variable is nested within another, more coarser, variable. For instance, a variable containing months ("Jan2017") is nested within another containing years ("2017")), a variable containing counties ("Durham County, NC") is nested within another containing states ("North Carolina"), and so on.

To check for this, you can follow this example:

{inp None}
sysuse auto gen turn10 = int(turn/10) mata: F = factor("turn") F.nested_within(st_data(., "trunk")) // False F.nested_within(st_data(., "turn")) // Trivially true F.nested_within(st_data(., "turn10")) // True end
{txt None}

You can also compare two factors directly:

{inp None}
mata: F1 = factor("turn") F2 = factor("turn10") F1.nested_within(F2.levels) // True end
{txt None}

Example: Updating a factor after dropping variables

If you change the underlying dataset you have to recreate the factor, which is costly. As an alternative, you can use .keep_obs() and related methods:

{inp None}
* Benchmark sysuse auto, clear drop if price > 4500 mata: F1 = factor("turn") // Quickly inspect results mata: F1.num_obs, F1.num_levels, hash1(F1.levels) * Using F.drop_obs() sysuse auto, clear mata price = st_data(., "price") F2 = factor("turn") idx = selectindex(price :> 4500) mata: F2.num_obs, F2.num_levels, hash1(F2.levels) F2.drop_obs(idx) mata: F2.num_obs, F2.num_levels, hash1(F2.levels) assert(F1.equals(F2)) end * Using the other methods mata F2 = factor("turn") idx = selectindex(price :<= 4500) F2.keep_obs(idx) assert(F1.equals(F2)) F2 = factor("turn") F2.drop_if(price :> 4500) assert(F1.equals(F2)) F2 = factor("turn") F2.keep_if(price :<= 4500) assert(F1.equals(F2)) end
{txt None}

Remarks

All-numeric and all-string varlists are allowed, but hybrid varlists (where some but not all variables are strings) are not possible due to Mata limitations. As a workaround, first convert the string variables to numeric (e.g. using fegen group()) and then run your intended command.

You can pass as varlist a string like "turn trunk" or a tokenized string like ("turn", "trunk").

To generate a group identifier, most commands first sort the data by a list of keys (such as gvkey, year) and then ask if the keys differ from one observation to the other. Instead, ftools exploits the insights that sorting the data is not required to create an identifier, and that once an identifier is created, we can then use a counting sort to sort the data in O(N) time instead of O log(N).

To create an identifier (that takes a value in {1, #keys}) we first match each key (composed by one or more numbers and strings) into a unique integer. For instance, the key gvkey=123, year=2010 is assigned the integer 4268248869 with the Mata function hash1. This identifier can then be used as an index when accessing vectors, bypassing the need for sorts.

The program tries to pick the hash function that best matches the dataset and input variables. For instance, if the input variables have a small range of possible values (e.g. if they are of byte type), we select the hash0 method, which uses a (non-minimal) perfect hashing but might consume a lot of memory. Alternatively, hash1 is used, which adds open addressing to Mata's hash1 function to create a form of open addressing (that is more efficient than Mata's asarray).

Using the functions from fcollapse

You can access the aggregate_*() functions so you can collapse information without resorting to Stata. Example:

{inp None}
sysuse auto, clear mata: F = factor("turn") mata: F.panelsetup() mata: y = st_data(., "price") mata: sum_y = aggregate_sum(F, F.sort(y), ., "") mata: F.keys, F.counts, sum_y * Benchmark collapse (sum) price, by(turn) list
{txt None} Functions start with aggregate_*(), and are listed {view fcollapse_functions.mata, adopath asis:here}

Experimental/advanced functions

real scalar init_zigzag(F1, F2, F12, F12_1, F12_2, queue, stack, subgraph_id, verbose)

Notes:

- Given the bipartite graph formed by F1 and F2, the function returns the number of disjoin subgraphs (mobility groups)
- F12 must be set with levels_as_keys==1
- For F12_1 and F12_2, you can set save_keys==0
- The function fills three useful vectors: queue, stack and subgraph_id
- If subgraph_id==0, it the id vector will not be created

Source code

{view ftools.mata, adopath asis:ftools.mata}; {view ftools_type_aliases.mata, adopath asis:ftools_type_aliases.mata}; {view ftools_main.mata, adopath asis:ftools_main.mata}; {view ftools_bipartite.mata, adopath asis:ftools_bipartite.mata} {view fcollapse_functions.mata, adopath asis:fcollapse_functions.mata}

Also, the latest version is available online: "https://github.com/sergiocorreia/ftools/source"

Author

Sergio Correia

"http://scorreia.com"
sergio.correia@gmail.com

More Information


To report bugs, contribute, ask for help, etc. please see the project URL in Github:
"https://github.com/sergiocorreia/ftools"

Acknowledgment

This project was largely inspired by the works of Wes McKinney, Andrew Maurer and Benn Jann.