The following Python packages/functions are used:
1. Introduction
This note is about replicating R functions written in Imputing missing data using EM algorithm under 2019: Methods for Multivariate Data. simulate_na
(which will be renamed as simulate_nan
here) and impute_em
are going to be written in Python, and the computation time of impute_em
will be checked in both Python and R.
2. Functions
All the details of how the algorithm works is presented here.
According to the following pandas
documentation, missing values in Python are denoted as:
NaN
in numeric arraysNone
orNaN
in object arraysNaT
in datetimelike
In this note, I will use np.nan
to denote missing components since we are dealing with numeric arrays.
Let’s generate the data using the same parameters as in the previous note:
np.random.seed(1024)
mu = np.array([1, 2, 6])
Sigma = np.array([[118, 62, 44], [62, 49, 17], [44, 17, 21]])
n = 400
X_truth = np.random.multivariate_normal(mu, Sigma, n)
Simulating np.nan
’s
Simulating np.nan
’s in a numeric array can be done using the same workflow as in simulate_na
:
def simulate_nan(X, nan_rate):
'''(np.array, number) -> {str: np.array or number}
Preconditions:
1. np.isnan(X_complete).any() == False
2. 0 <= nan_rate <= 1
Return the dictionary with four keys where:
- Key 'X' stores a np.array where some of the entries in X
are replaced with np.nan based on nan_rate specified.
- Key 'C' stores a np.array where each entry is False if the
corresponding entry in the key 'X''s np.array is np.nan, and True
otherwise.
- Key 'nan_rate' stores nan_rate specified.
- Key 'nan_rate_actual' stores the actual proportion of np.nan
in the key 'X''s np.array.
'''
# Create C matrix; entry is False if missing, and True if observed
X_complete = X.copy()
nr, nc = X_complete.shape
C = np.random.random(nr * nc).reshape(nr, nc) > nan_rate
# Check for which i's we have all components become missing
checker = np.where(sum(C.T) == 0)[0]
if len(checker) == 0:
# Every X_i has at least one component that is observed,
# which is what we want
X_complete[C == False] = np.nan
else:
# Otherwise, randomly "revive" some components in such X_i's
for index in checker:
reviving_components = np.random.choice(
nc,
int(np.ceil(nc * np.random.random())),
replace = False
)
C[index, np.ix_(reviving_components)] = True
X_complete[C == False] = np.nan
result = {
'X': X_complete,
'C': C,
'nan_rate': nan_rate,
'nan_rate_actual': np.sum(C == False) / (nr * nc)
}
return result
As before, nan_rate
is set to be \(.4\):
X
is the data with missing components.
Let’s see if every \(X_i\) has at least one observed component:
## True
Yes, the array of rowsums of C
matrix has no 0
in its element. This implies each row of C
contains at least one \(1\), i.e. every \(X_i\) has at least one observed component.
Here’s nan_rate_actual
:
## 0.3933333333333333
Imputing np.nan
’s
In Python, impute_em
can be written as follows:
def impute_em(X, max_iter = 3000, eps = 1e-08):
'''(np.array, int, number) -> {str: np.array or int}
Precondition: max_iter >= 1 and eps > 0
Return the dictionary with five keys where:
- Key 'mu' stores the mean estimate of the imputed data.
- Key 'Sigma' stores the variance estimate of the imputed data.
- Key 'X_imputed' stores the imputed data that is mutated from X using
the EM algorithm.
- Key 'C' stores the np.array that specifies the original missing entries
of X.
- Key 'iteration' stores the number of iteration used to compute
'X_imputed' based on max_iter and eps specified.
'''
nr, nc = X.shape
C = np.isnan(X) == False
# Collect M_i and O_i's
one_to_nc = np.arange(1, nc + 1, step = 1)
M = one_to_nc * (C == False) - 1
O = one_to_nc * C - 1
# Generate Mu_0 and Sigma_0
Mu = np.nanmean(X, axis = 0)
observed_rows = np.where(np.isnan(sum(X.T)) == False)[0]
S = np.cov(X[observed_rows, ].T)
if np.isnan(S).any():
S = np.diag(np.nanvar(X, axis = 0))
# Start updating
Mu_tilde, S_tilde = {}, {}
X_tilde = X.copy()
no_conv = True
iteration = 0
while no_conv and iteration < max_iter:
for i in range(nr):
S_tilde[i] = np.zeros(nc ** 2).reshape(nc, nc)
if set(O[i, ]) != set(one_to_nc - 1): # missing component exists
M_i, O_i = M[i, ][M[i, ] != -1], O[i, ][O[i, ] != -1]
S_MM = S[np.ix_(M_i, M_i)]
S_MO = S[np.ix_(M_i, O_i)]
S_OM = S_MO.T
S_OO = S[np.ix_(O_i, O_i)]
Mu_tilde[i] = Mu[np.ix_(M_i)] +\
S_MO @ np.linalg.inv(S_OO) @\
(X_tilde[i, O_i] - Mu[np.ix_(O_i)])
X_tilde[i, M_i] = Mu_tilde[i]
S_MM_O = S_MM - S_MO @ np.linalg.inv(S_OO) @ S_OM
S_tilde[i][np.ix_(M_i, M_i)] = S_MM_O
Mu_new = np.mean(X_tilde, axis = 0)
S_new = np.cov(X_tilde.T, bias = 1) +\
reduce(np.add, S_tilde.values()) / nr
no_conv =\
np.linalg.norm(Mu - Mu_new) >= eps or\
np.linalg.norm(S - S_new, ord = 2) >= eps
Mu = Mu_new
S = S_new
iteration += 1
result = {
'mu': Mu,
'Sigma': S,
'X_imputed': X_tilde,
'C': C,
'iteration': iteration
}
return result
We can now impute the missing components of X
:
The estimates are as shown:
## array([1.54038092, 1.77970878, 6.47510847])
## array([1, 2, 6])
## array([[120.12963893, 68.4844652 , 44.27110069],
## [ 68.4844652 , 55.16383917, 19.13314671],
## [ 44.27110069, 19.13314671, 20.85077943]])
## array([[118, 62, 44],
## [ 62, 49, 17],
## [ 44, 17, 21]])
The imputation is done as follows:
## array([[-22.51504305, nan, nan],
## [ -4.13801604, nan, nan],
## [ -6.96117925, -4.78845229, 1.21653198],
## [ 6.57201047, 6.0520226 , 8.87408451],
## [ nan, nan, 9.17663177],
## [ nan, 1.51178112, nan],
## [ nan, -5.44848469, 4.44208904],
## [ -4.17891721, -3.48215875, 4.29849034],
## [-11.29647092, 1.28621915, nan]])
## array([[-22.51504305, -11.93399968, -2.38998182],
## [ -4.13801604, -1.45747717, 4.38246185],
## [ -6.96117925, -4.78845229, 1.21653198],
## [ 6.57201047, 6.0520226 , 8.87408451],
## [ 7.27634923, 4.25868765, 9.17663177],
## [ 1.20775573, 1.51178112, 6.38217986],
## [ -6.74901264, -5.44848469, 4.44208904],
## [ -4.17891721, -3.48215875, 4.29849034],
## [-11.29647092, 1.28621915, -0.84012076]])
## array([[-22.51504305, -11.21931542, -0.31680441],
## [ -4.13801604, -3.3813311 , 3.97850866],
## [ -6.96117925, -4.78845229, 1.21653198],
## [ 6.57201047, 6.0520226 , 8.87408451],
## [ 8.07906102, 3.46018924, 9.17663177],
## [ -2.71777693, 1.51178112, 4.17435875],
## [ -5.25896933, -5.44848469, 4.44208904],
## [ -4.17891721, -3.48215875, 4.29849034],
## [-11.29647092, 1.28621915, 0.98402706]])
3. Comparison with the R function
It takes:
## '0:00:11.452083'
seconds to impute X
in Python. Using the same X
, let’s see how long the process takes in R. I am using a different name (impute_em_R
) for the R function:
## [,1] [,2]
## [1,] -22.515043 NaN
## [2,] -4.138016 NaN
## [3,] -6.961179 -4.788452
## [4,] 6.572010 6.052023
## [5,] NaN NaN
## [6,] NaN 1.511781
## [7,] NaN -5.448485
## [8,] -4.178917 -3.482159
## [9,] -11.296471 1.286219
## [,3]
## [1,] NaN
## [2,] NaN
## [3,] 1.216532
## [4,] 8.874085
## [5,] 9.176632
## [6,] NaN
## [7,] 4.442089
## [8,] 4.298490
## [9,] NaN
We don’t need to replace NaN
’s with NA
’s because is.na(NaN)
returns TRUE
.
Let’s first check if two functions yield the same result:
## array([1.54038092, 1.77970878, 6.47510847])
## [1] 1.540381 1.779709 6.475108
## array([[120.12963893, 68.4844652 , 44.27110069],
## [ 68.4844652 , 55.16383917, 19.13314671],
## [ 44.27110069, 19.13314671, 20.85077943]])
## [,1] [,2]
## [1,] 120.12964 68.48447
## [2,] 68.48447 55.16384
## [3,] 44.27110 19.13315
## [,3]
## [1,] 44.27110
## [2,] 19.13315
## [3,] 20.85078
They are all the same, as they should be. What about the time it took to impute in R?
## Time difference of 4.443533 secs
Interestingly, R is about 2.5 times as fast as Python for this particular process. One explanation is that in impute_em
written in Python, indices corresponding to M_i
and O_i
are extracted in each iteration of nested loop (for-loop inside the while-loop), whereas all of M_i
and O_i
are extracted using lapply
in R before going over the loops.
Session info
R session info:
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_Canada.1252
## [2] LC_CTYPE=English_Canada.1252
## [3] LC_MONETARY=English_Canada.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_Canada.1252
##
## attached base packages:
## [1] stats graphics
## [3] grDevices utils
## [5] datasets methods
## [7] base
##
## other attached packages:
## [1] ggConvexHull_0.1.0
## [2] dplyr_0.8.3
## [3] reticulate_1.13
## [4] pROC_1.15.3
## [5] ggrepel_0.8.1
## [6] ggplot2_3.2.1
## [7] funpark_0.2.6
## [8] data.table_1.12.6
## [9] boot_1.3-22
## [10] rmarkdown_1.17
## [11] magrittr_1.5
## [12] itertools2_0.1.1
##
## loaded via a namespace (and not attached):
## [1] prettydoc_0.3.1
## [2] tidyselect_0.2.5
## [3] xfun_0.11
## [4] purrr_0.3.3
## [5] lattice_0.20-38
## [6] colorspace_1.4-1
## [7] vctrs_0.2.0
## [8] htmltools_0.4.0
## [9] yaml_2.2.0
## [10] utf8_1.1.4
## [11] rlang_0.4.2
## [12] pillar_1.4.2
## [13] glue_1.3.1
## [14] withr_2.1.2
## [15] lifecycle_0.1.0
## [16] plyr_1.8.4
## [17] stringr_1.4.0
## [18] munsell_0.5.0
## [19] gtable_0.3.0
## [20] evaluate_0.14
## [21] labeling_0.3
## [22] knitr_1.26
## [23] fansi_0.4.0
## [24] Rcpp_1.0.3
## [25] readr_1.3.1
## [26] scales_1.1.0
## [27] backports_1.1.5
## [28] jsonlite_1.6
## [29] farver_2.0.1
## [30] gridExtra_2.3
## [31] png_0.1-7
## [32] hms_0.5.2
## [33] digest_0.6.23
## [34] stringi_1.4.3
## [35] grid_3.6.1
## [36] cli_1.1.0
## [37] tools_3.6.1
## [38] lazyeval_0.2.2
## [39] tibble_2.1.3
## [40] crayon_1.3.4
## [41] tidyr_1.0.0
## [42] pkgconfig_2.0.3
## [43] zeallot_0.1.0
## [44] MASS_7.3-51.4
## [45] ellipsis_0.3.0
## [46] Matrix_1.2-17
## [47] xml2_1.2.2
## [48] assertthat_0.2.1
## [49] rstudioapi_0.10
## [50] iterators_1.0.12
## [51] R6_2.4.1
## [52] compiler_3.6.1
Python session info:
## -----
## matplotlib 3.1.1
## numpy 1.17.0
## scipy 1.1.0
## -----
## Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 23:09:28) [MSC v.1916 64 bit (AMD64)]
## Windows-10-10.0.18362-SP0
## 4 logical CPU cores, Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
## -----
## Session information updated at 2020-01-12 22:47