FTOOLS —
|
Mata commands for factor variables |
class Factor scalar factor(
{space 3}varnames [{space 1} ,
touse,
verbose,
method,
sort_levels,
count_levels,
hash_ratio,
save_keys])
class Factor scalar _factor(
data [,
integers_only,
verbose,
method,
sort_levels,
count_levels,
hash_ratio,
save_keys])
class Factor scalar join_factors(
F1,
F2 [,
count_levels,
save_keys,
levels_as_keys])
Options | Description | |
* | string varnames | names of variables that identify the factors |
string touse | name of dummy touse variable | |
note: you can also pass a vector with the obs. index (i.e. the first argument of st_data() ) |
||
string data | transmorphic matrix with the group identifiers | |
Advanced options: | ||
real verbose | 1 to display debug information | |
string method | hashing method: mata, hash0, hash1, hash2; default is mata (auto-choose) | |
real sort_levels | set to 0 under hash1 to increase speed, but the new levels will not match the order of the varlist | |
real count_levels | set to 0 under hash0 to increase speed, but the F.counts vector will not be generated so F.panelsetup() , F.drop_obs() , and related methods will not be available |
|
real hash_ratio | size of the hash vector compared to the maximum number of keys (often num. obs.) | |
real save_keys | set to 0 to increase speed and save memory, but the matrix F.keys with the original values of the factors won't be created | |
string integers_only | whether data is numeric and takes only integers or not (unless you are sure of the former, set it to 0) | |
real levels_as_keys | if set to 1, join_factors() will use the levels of F1 and F2 as the keys (as the data) when creating F12 |
(optional) First, you can declare the Factor object:
class Factor scalar
F
Then, you can create a factor from one or more categorical variables:
F =
factor(
varnames)
If the categories are already in Mata (data = st_data(., varnames)
), you can do:
F =
_factor(
data)
You can also combine two factors (F1 and F2):
F =
join_factors(
F1,
F2)
Note that the above is exactly equivalent (but faster) than:
varnames = invtokens((
F1.varnames,
F2.varnames))
F =
factor(
varnames)
If levels_as_keys==1, it is equivalent to:
F =
_factor((
F1.levels,
F2.levels))
properties | Description | |
real F.num_levels
|
number of levels (distinct values) of the factor | |
real F.num_obs
|
number of observations of the sample used to create the factor (c(N) if touse was empty) |
|
real colvector F.levels
|
levels of the factor; dimension F.num_obs x 1 ; range: {cmd:{1, ..., F.num_levels}}{p_end} {synopt None}transmorphic matrix F.keys values of the input varlist that correspond to the factor levels; dimension F.num_levels x 1 ; not created if save_keys==0; unordered if sort_levels==0 |
|
real vector F.counts
|
frequencies of each level (in the sample set by touse); dimension F.num_levels x 1 ; will be empty if count_levels==0 |
|
string rowvector F.varlist
|
name of variables used to create the factor | |
string rowvector F.varformats
|
formats of the input variables | |
string rowvector F.varlabels
|
labels of the input variables | |
string rowvector F.varvaluelabels
|
value labels attached to the input variables | |
string rowvector F.vartypes
|
types of the input variables | |
string rowvector F.vl
|
value label definitions used by the input variables | |
string F.touse
|
name of touse variable | |
string F.is_sorted
|
1 if the dataset is sorted by F.varlist
|
main methods | Description | |
void F.store_levels( newvar)
|
save the levels back into the dataset (using the same touse) | |
void F.store_keys( [sort])
|
save the original key variables into a reduced dataset, including formatting and labels. If sort is 1, Stata will report the dataset as sorted | |
void F.panelsetup()
|
compute auxiliary vectors F.info and F.p (see below); used in panel computations |
ancilliary methods | Description | |
real scalar F.equals( F2)
|
1 if F represents the same data as F2 (i.e. if .num_obs .num_levels .levels .keys and .counts are equal) | |
real scalar F.nested_within(vec)
|
1 if the factor F is nested within the column vector vec (i.e. if any two obs. with the same factor level also have the same value of vec). For instance, it is true if the factor F represents counties and vec represents states. | |
void F.drop_obs( idx)
|
update F to reflect a change in the underlying dataset, where the observations listed in the column vector idx are dropped (see example below) | |
void F.keep_obs( idx)
|
equivalent to keeping only the obs. enumerated by idx and recreating F; uses .drop_obs() |
|
void F.drop_if( vec)
|
equivalent to dropping the obs. where vec==0 and recreating F; uses .drop_obs() |
|
void F.keep_if( vec)
|
equivalent to keeping the obs. where vec!=0 and recreating F; uses .drop_obs() |
|
real colvector F.drop_singletons()
|
equivalent to dropping the levels that only appear once, and their corresponding observations. The colvector returned contains the observations that need to be excluded (note: see the source code for some advanced optional arguments). | |
real scalar F.is_id()
|
1 if F.counts is always 1 (i.e. if F.levels has no duplicates) | |
real vector F.intersect( vec)
|
return a mask vector equal to 1 if the row of vec is also on F.keys. Also accepts the integers_only and verbose options: mask = F.intersect(y, 1, 1) |
available after F.panelsetup() | Description | |
transmorphic matrix F.sort( data)
|
equivalent to data[F.p, .] but calls F.panelsetup() if required; data is a transmorphic matrix
|
|
transmorphic matrix F.invsort( data)
|
equivalent to data[invorder(F.p), .] , so it undoes a previous sort operation. Note that F.invsort(F.sort(x))==x . Also, after used it fills the vector F.inv_p = invorder(F.p) so the operation can be repeated easily. |
|
void F._sort( data)
|
in-place version of .sort() ; slower but uses less memory, as it's based on _collate()
|
|
real vector F.info
|
equivalent to panelsetup() (returns a (num_levels X 2) matrix with start and end positions of each level/panel). | |
note: instead of using F.info directly, use panelsubmatrix(): x = panelsubmatrix(X, i, F.info) and panelsum() (see example at the end) |
||
real vector F.p
|
equivalent to order(F.levels) but implemented with a counting sort that is asymptotically faster (O(N) instead of O(N log N). |
|
note: do not use F.p directly, as it will be missing if the data is already sorted by the varnames. |
Notes:
- | If you just downloaded the package and want to use the Mata functions directly (instead of the Stata commands), run . ftools once to, which creates the Mata library if needed. |
|
- | To force compilation of the Mata library, type . ftools, compile
|
|
- |
F.extra is an undocumented asarray that can be used to store additional information: asarray(f.extra, "lorem", "ipsum") ; and retrieve it: ipsum = asarray(f.extra, "lorem")
|
|
- |
join_factors() is particularly fast if the dataset is sorted in the same order as the factors |
|
- |
factor() will call join_factors() if appropriate (2+ integer variables; 10,000+ obs; and method=hash1) |
The Factor object is a key component of several commands that manipulate data without having to sort it beforehand:
- fcollapse (alternative to collapse, contract, collapse+merge and some egen functions)
- fisid
- join and fmerge (alternative to m:1 and 1:1 merges)
- flevelsof plug-in alternative to levelsof
- fsort (note: this is O(N) but with a high constant term)
- freshape
Ancilliary commands include:- local_inlist return local inlist based on a variable and a list of values or labels
It rearranges one or more categorical variables into a new variable that takes values from 1 to F.num_levels. You can then efficiently sort any other variable by this, in order to compute groups statistics and other manipulations.
For technical information, see [1] [2], and to a lesser degree [3].
If you only want to create identifiers based on one or more variables, run something like:
{inp None}More complex scenarios would involve some of the following:
{inp None}This example shows how to process data for each level of the factor (like bysort). It does so by combining F.sort()
with panelsubmatrix().
In particular, this code runs a regression for each category of turn:
{inp None}You might be interested in knowing if a categorical variable is nested within another, more coarser, variable. For instance, a variable containing months ("Jan2017") is nested within another containing years ("2017")), a variable containing counties ("Durham County, NC") is nested within another containing states ("North Carolina"), and so on.
To check for this, you can follow this example:
{inp None}You can also compare two factors directly:
{inp None}If you change the underlying dataset you have to recreate the factor, which is costly. As an alternative, you can use .keep_obs()
and related methods:
All-numeric and all-string varlists are allowed, but hybrid varlists (where some but not all variables are strings) are not possible due to Mata limitations. As a workaround, first convert the string variables to numeric (e.g. using fegen group()
) and then run your intended command.
You can pass as varlist a string like "turn trunk" or a tokenized string like ("turn", "trunk").
To generate a group identifier, most commands first sort the data by a list of keys (such as gvkey, year) and then ask if the keys differ from one observation to the other. Instead, ftools
exploits the insights that sorting the data is not required to create an identifier, and that once an identifier is created, we can then use a counting sort to sort the data in O(N) time instead of O log(N).
To create an identifier (that takes a value in {1, #keys}) we first match each key (composed by one or more numbers and strings) into a unique integer. For instance, the key gvkey=123, year=2010 is assigned the integer 4268248869 with the Mata function hash1
. This identifier can then be used as an index when accessing vectors, bypassing the need for sorts.
The program tries to pick the hash function that best matches the dataset and input variables. For instance, if the input variables have a small range of possible values (e.g. if they are of byte type), we select the hash0 method, which uses a (non-minimal) perfect hashing but might consume a lot of memory. Alternatively, hash1 is used, which adds open addressing to Mata's hash1 function to create a form of open addressing (that is more efficient than Mata's asarray).
You can access the aggregate_*()
functions so you can collapse information without resorting to Stata. Example:
aggregate_*()
, and are listed {view fcollapse_functions.mata, adopath asis:here}
real scalar init_zigzag(
F1,
F2,
F12,
F12_1,
F12_2,
queue,
stack,
subgraph_id,
verbose)
Notes:
- | Given the bipartite graph formed by F1 and F2, the function returns the number of disjoin subgraphs (mobility groups) | |
- | F12 must be set with levels_as_keys==1 | |
- | For F12_1 and F12_2, you can set save_keys==0 | |
- | The function fills three useful vectors: queue, stack and subgraph_id | |
- | If subgraph_id==0, it the id vector will not be created |
{view ftools.mata, adopath asis:ftools.mata}; {view ftools_type_aliases.mata, adopath asis:ftools_type_aliases.mata}; {view ftools_main.mata, adopath asis:ftools_main.mata}; {view ftools_bipartite.mata, adopath asis:ftools_bipartite.mata} {view fcollapse_functions.mata, adopath asis:fcollapse_functions.mata}
Also, the latest version is available online: "https://github.com/sergiocorreia/ftools/source"
Sergio Correia
"http://scorreia.com"
sergio.correia@gmail.com
To report bugs, contribute, ask for help, etc. please see the project URL in Github:
"https://github.com/sergiocorreia/ftools"
This project was largely inspired by the works of Wes McKinney, Andrew Maurer and Benn Jann.