The following Python packages/functions are used:

from datetime import datetime as dt
import numpy as np
from functools import reduce
 

1. Introduction

This note is about replicating R functions written in Imputing missing data using EM algorithm under 2019: Methods for Multivariate Data. simulate_na (which will be renamed as simulate_nan here) and impute_em are going to be written in Python, and the computation time of impute_em will be checked in both Python and R.

2. Functions

All the details of how the algorithm works is presented here.

According to the following pandas documentation, missing values in Python are denoted as:

In this note, I will use np.nan to denote missing components since we are dealing with numeric arrays.

Let’s generate the data using the same parameters as in the previous note:

Simulating np.nan’s

Simulating np.nan’s in a numeric array can be done using the same workflow as in simulate_na:

As before, nan_rate is set to be \(.4\):

X is the data with missing components.

Let’s see if every \(X_i\) has at least one observed component:

## True

Yes, the array of rowsums of C matrix has no 0 in its element. This implies each row of C contains at least one \(1\), i.e. every \(X_i\) has at least one observed component.

Here’s nan_rate_actual:

## 0.3933333333333333

Imputing np.nan’s

In Python, impute_em can be written as follows:

def impute_em(X, max_iter = 3000, eps = 1e-08):
    '''(np.array, int, number) -> {str: np.array or int}
    
    Precondition: max_iter >= 1 and eps > 0
    
    Return the dictionary with five keys where:
    - Key 'mu' stores the mean estimate of the imputed data.
    - Key 'Sigma' stores the variance estimate of the imputed data.
    - Key 'X_imputed' stores the imputed data that is mutated from X using 
      the EM algorithm.
    - Key 'C' stores the np.array that specifies the original missing entries
      of X.
    - Key 'iteration' stores the number of iteration used to compute
      'X_imputed' based on max_iter and eps specified.
    '''
    
    nr, nc = X.shape
    C = np.isnan(X) == False
    
    # Collect M_i and O_i's
    one_to_nc = np.arange(1, nc + 1, step = 1)
    M = one_to_nc * (C == False) - 1
    O = one_to_nc * C - 1
    
    # Generate Mu_0 and Sigma_0
    Mu = np.nanmean(X, axis = 0)
    observed_rows = np.where(np.isnan(sum(X.T)) == False)[0]
    S = np.cov(X[observed_rows, ].T)
    if np.isnan(S).any():
        S = np.diag(np.nanvar(X, axis = 0))
    
    # Start updating
    Mu_tilde, S_tilde = {}, {}
    X_tilde = X.copy()
    no_conv = True
    iteration = 0
    while no_conv and iteration < max_iter:
        for i in range(nr):
            S_tilde[i] = np.zeros(nc ** 2).reshape(nc, nc)
            if set(O[i, ]) != set(one_to_nc - 1): # missing component exists
                M_i, O_i = M[i, ][M[i, ] != -1], O[i, ][O[i, ] != -1]
                S_MM = S[np.ix_(M_i, M_i)]
                S_MO = S[np.ix_(M_i, O_i)]
                S_OM = S_MO.T
                S_OO = S[np.ix_(O_i, O_i)]
                Mu_tilde[i] = Mu[np.ix_(M_i)] +\
                    S_MO @ np.linalg.inv(S_OO) @\
                    (X_tilde[i, O_i] - Mu[np.ix_(O_i)])
                X_tilde[i, M_i] = Mu_tilde[i]
                S_MM_O = S_MM - S_MO @ np.linalg.inv(S_OO) @ S_OM
                S_tilde[i][np.ix_(M_i, M_i)] = S_MM_O
        Mu_new = np.mean(X_tilde, axis = 0)
        S_new = np.cov(X_tilde.T, bias = 1) +\
            reduce(np.add, S_tilde.values()) / nr
        no_conv =\
            np.linalg.norm(Mu - Mu_new) >= eps or\
            np.linalg.norm(S - S_new, ord = 2) >= eps
        Mu = Mu_new
        S = S_new
        iteration += 1
    
    result = {
        'mu': Mu,
        'Sigma': S,
        'X_imputed': X_tilde,
        'C': C,
        'iteration': iteration
    }
    
    return result

We can now impute the missing components of X:

The estimates are as shown:

## array([1.54038092, 1.77970878, 6.47510847])
## array([1, 2, 6])
## array([[120.12963893,  68.4844652 ,  44.27110069],
##        [ 68.4844652 ,  55.16383917,  19.13314671],
##        [ 44.27110069,  19.13314671,  20.85077943]])
## array([[118,  62,  44],
##        [ 62,  49,  17],
##        [ 44,  17,  21]])

The imputation is done as follows:

## array([[-22.51504305,          nan,          nan],
##        [ -4.13801604,          nan,          nan],
##        [ -6.96117925,  -4.78845229,   1.21653198],
##        [  6.57201047,   6.0520226 ,   8.87408451],
##        [         nan,          nan,   9.17663177],
##        [         nan,   1.51178112,          nan],
##        [         nan,  -5.44848469,   4.44208904],
##        [ -4.17891721,  -3.48215875,   4.29849034],
##        [-11.29647092,   1.28621915,          nan]])
## array([[-22.51504305, -11.93399968,  -2.38998182],
##        [ -4.13801604,  -1.45747717,   4.38246185],
##        [ -6.96117925,  -4.78845229,   1.21653198],
##        [  6.57201047,   6.0520226 ,   8.87408451],
##        [  7.27634923,   4.25868765,   9.17663177],
##        [  1.20775573,   1.51178112,   6.38217986],
##        [ -6.74901264,  -5.44848469,   4.44208904],
##        [ -4.17891721,  -3.48215875,   4.29849034],
##        [-11.29647092,   1.28621915,  -0.84012076]])
## array([[-22.51504305, -11.21931542,  -0.31680441],
##        [ -4.13801604,  -3.3813311 ,   3.97850866],
##        [ -6.96117925,  -4.78845229,   1.21653198],
##        [  6.57201047,   6.0520226 ,   8.87408451],
##        [  8.07906102,   3.46018924,   9.17663177],
##        [ -2.71777693,   1.51178112,   4.17435875],
##        [ -5.25896933,  -5.44848469,   4.44208904],
##        [ -4.17891721,  -3.48215875,   4.29849034],
##        [-11.29647092,   1.28621915,   0.98402706]])

3. Comparison with the R function

It takes:

## '0:00:11.452083'

seconds to impute X in Python. Using the same X, let’s see how long the process takes in R. I am using a different name (impute_em_R) for the R function:

##             [,1]      [,2]
##  [1,] -22.515043       NaN
##  [2,]  -4.138016       NaN
##  [3,]  -6.961179 -4.788452
##  [4,]   6.572010  6.052023
##  [5,]        NaN       NaN
##  [6,]        NaN  1.511781
##  [7,]        NaN -5.448485
##  [8,]  -4.178917 -3.482159
##  [9,] -11.296471  1.286219
##           [,3]
##  [1,]      NaN
##  [2,]      NaN
##  [3,] 1.216532
##  [4,] 8.874085
##  [5,] 9.176632
##  [6,]      NaN
##  [7,] 4.442089
##  [8,] 4.298490
##  [9,]      NaN

We don’t need to replace NaN’s with NA’s because is.na(NaN) returns TRUE.

Let’s first check if two functions yield the same result:

## array([1.54038092, 1.77970878, 6.47510847])
## [1] 1.540381 1.779709 6.475108
## array([[120.12963893,  68.4844652 ,  44.27110069],
##        [ 68.4844652 ,  55.16383917,  19.13314671],
##        [ 44.27110069,  19.13314671,  20.85077943]])
##           [,1]     [,2]
## [1,] 120.12964 68.48447
## [2,]  68.48447 55.16384
## [3,]  44.27110 19.13315
##          [,3]
## [1,] 44.27110
## [2,] 19.13315
## [3,] 20.85078

They are all the same, as they should be. What about the time it took to impute in R?

## Time difference of 4.443533 secs

Interestingly, R is about 2.5 times as fast as Python for this particular process. One explanation is that in impute_em written in Python, indices corresponding to M_i and O_i are extracted in each iteration of nested loop (for-loop inside the while-loop), whereas all of M_i and O_i are extracted using lapply in R before going over the loops.

Session info

R session info:

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Canada.1252 
## [2] LC_CTYPE=English_Canada.1252   
## [3] LC_MONETARY=English_Canada.1252
## [4] LC_NUMERIC=C                   
## [5] LC_TIME=English_Canada.1252    
## 
## attached base packages:
## [1] stats     graphics 
## [3] grDevices utils    
## [5] datasets  methods  
## [7] base     
## 
## other attached packages:
##  [1] ggConvexHull_0.1.0
##  [2] dplyr_0.8.3       
##  [3] reticulate_1.13   
##  [4] pROC_1.15.3       
##  [5] ggrepel_0.8.1     
##  [6] ggplot2_3.2.1     
##  [7] funpark_0.2.6     
##  [8] data.table_1.12.6 
##  [9] boot_1.3-22       
## [10] rmarkdown_1.17    
## [11] magrittr_1.5      
## [12] itertools2_0.1.1  
## 
## loaded via a namespace (and not attached):
##  [1] prettydoc_0.3.1 
##  [2] tidyselect_0.2.5
##  [3] xfun_0.11       
##  [4] purrr_0.3.3     
##  [5] lattice_0.20-38 
##  [6] colorspace_1.4-1
##  [7] vctrs_0.2.0     
##  [8] htmltools_0.4.0 
##  [9] yaml_2.2.0      
## [10] utf8_1.1.4      
## [11] rlang_0.4.2     
## [12] pillar_1.4.2    
## [13] glue_1.3.1      
## [14] withr_2.1.2     
## [15] lifecycle_0.1.0 
## [16] plyr_1.8.4      
## [17] stringr_1.4.0   
## [18] munsell_0.5.0   
## [19] gtable_0.3.0    
## [20] evaluate_0.14   
## [21] labeling_0.3    
## [22] knitr_1.26      
## [23] fansi_0.4.0     
## [24] Rcpp_1.0.3      
## [25] readr_1.3.1     
## [26] scales_1.1.0    
## [27] backports_1.1.5 
## [28] jsonlite_1.6    
## [29] farver_2.0.1    
## [30] gridExtra_2.3   
## [31] png_0.1-7       
## [32] hms_0.5.2       
## [33] digest_0.6.23   
## [34] stringi_1.4.3   
## [35] grid_3.6.1      
## [36] cli_1.1.0       
## [37] tools_3.6.1     
## [38] lazyeval_0.2.2  
## [39] tibble_2.1.3    
## [40] crayon_1.3.4    
## [41] tidyr_1.0.0     
## [42] pkgconfig_2.0.3 
## [43] zeallot_0.1.0   
## [44] MASS_7.3-51.4   
## [45] ellipsis_0.3.0  
## [46] Matrix_1.2-17   
## [47] xml2_1.2.2      
## [48] assertthat_0.2.1
## [49] rstudioapi_0.10 
## [50] iterators_1.0.12
## [51] R6_2.4.1        
## [52] compiler_3.6.1

Python session info:

## -----
## matplotlib   3.1.1
## numpy        1.17.0
## scipy        1.1.0
## -----
## Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 23:09:28) [MSC v.1916 64 bit (AMD64)]
## Windows-10-10.0.18362-SP0
## 4 logical CPU cores, Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
## -----
## Session information updated at 2020-01-12 22:47