The following Python packages/functions are used:

from datetime import datetime as dt
import numpy as np
from functools import reduce

1. Introduction
2. Functions
- Simulating np.nan's
- Imputing np.nan's
3. Comparison with the R function
Session info
Related pages

1. Introduction

This note is about replicating R functions written in Imputing missing data using EM algorithm under 2019: Methods for Multivariate Data. simulate_na (which will be renamed as simulate_nan here) and impute_em are going to be written in Python, and the computation time of impute_em will be checked in both Python and R.

2. Functions

All the details of how the algorithm works is presented here.

According to the following pandas documentation, missing values in Python are denoted as:

NaN in numeric arrays
None or NaN in object arrays
NaT in datetimelike

In this note, I will use np.nan to denote missing components since we are dealing with numeric arrays.

Let’s generate the data using the same parameters as in the previous note:

np.random.seed(1024)
mu = np.array([1, 2, 6])
Sigma = np.array([[118, 62, 44], [62, 49, 17], [44, 17, 21]])
n = 400
X_truth = np.random.multivariate_normal(mu, Sigma, n)

Simulating `np.nan`’s

Simulating np.nan’s in a numeric array can be done using the same workflow as in simulate_na:

def simulate_nan(X, nan_rate):
    '''(np.array, number) -> {str: np.array or number}
    
    Preconditions:
    1. np.isnan(X_complete).any() == False
    2. 0 <= nan_rate <= 1
    
    Return the dictionary with four keys where: 
    - Key 'X' stores a np.array where some of the entries in X 
      are replaced with np.nan based on nan_rate specified.
    - Key 'C' stores a np.array where each entry is False if the
      corresponding entry in the key 'X''s np.array is np.nan, and True
      otherwise.
    - Key 'nan_rate' stores nan_rate specified.
    - Key 'nan_rate_actual' stores the actual proportion of np.nan
      in the key 'X''s np.array.
    '''
    
    # Create C matrix; entry is False if missing, and True if observed
    X_complete = X.copy()
    nr, nc = X_complete.shape
    C = np.random.random(nr * nc).reshape(nr, nc) > nan_rate
    
    # Check for which i's we have all components become missing
    checker = np.where(sum(C.T) == 0)[0]
    if len(checker) == 0:
        # Every X_i has at least one component that is observed,
        # which is what we want
        X_complete[C == False] = np.nan
    else:
        # Otherwise, randomly "revive" some components in such X_i's
        for index in checker:
            reviving_components = np.random.choice(
                nc, 
                int(np.ceil(nc * np.random.random())), 
                replace = False
            )
            C[index, np.ix_(reviving_components)] = True
        X_complete[C == False] = np.nan
    
    result = {
        'X': X_complete,
        'C': C,
        'nan_rate': nan_rate,
        'nan_rate_actual': np.sum(C == False) / (nr * nc)
    }
    
    return result

As before, nan_rate is set to be \(.4\):

result = simulate_nan(X_truth, nan_rate = .4)
X = result['X'].copy()

X is the data with missing components.

Let’s see if every \(X_i\) has at least one observed component:

(sum((np.isnan(X) == False).T) == 0).any() == False

## True

Yes, the array of rowsums of C matrix has no 0 in its element. This implies each row of C contains at least one \(1\), i.e. every \(X_i\) has at least one observed component.

Here’s nan_rate_actual:

result['nan_rate_actual']

## 0.3933333333333333

Imputing `np.nan`’s

In Python, impute_em can be written as follows:

def impute_em(X, max_iter = 3000, eps = 1e-08):
    '''(np.array, int, number) -> {str: np.array or int}
    
    Precondition: max_iter >= 1 and eps > 0
    
    Return the dictionary with five keys where:
    - Key 'mu' stores the mean estimate of the imputed data.
    - Key 'Sigma' stores the variance estimate of the imputed data.
    - Key 'X_imputed' stores the imputed data that is mutated from X using 
      the EM algorithm.
    - Key 'C' stores the np.array that specifies the original missing entries
      of X.
    - Key 'iteration' stores the number of iteration used to compute
      'X_imputed' based on max_iter and eps specified.
    '''
    
    nr, nc = X.shape
    C = np.isnan(X) == False
    
    # Collect M_i and O_i's
    one_to_nc = np.arange(1, nc + 1, step = 1)
    M = one_to_nc * (C == False) - 1
    O = one_to_nc * C - 1
    
    # Generate Mu_0 and Sigma_0
    Mu = np.nanmean(X, axis = 0)
    observed_rows = np.where(np.isnan(sum(X.T)) == False)[0]
    S = np.cov(X[observed_rows, ].T)
    if np.isnan(S).any():
        S = np.diag(np.nanvar(X, axis = 0))
    
    # Start updating
    Mu_tilde, S_tilde = {}, {}
    X_tilde = X.copy()
    no_conv = True
    iteration = 0
    while no_conv and iteration < max_iter:
        for i in range(nr):
            S_tilde[i] = np.zeros(nc ** 2).reshape(nc, nc)
            if set(O[i, ]) != set(one_to_nc - 1): # missing component exists
                M_i, O_i = M[i, ][M[i, ] != -1], O[i, ][O[i, ] != -1]
                S_MM = S[np.ix_(M_i, M_i)]
                S_MO = S[np.ix_(M_i, O_i)]
                S_OM = S_MO.T
                S_OO = S[np.ix_(O_i, O_i)]
                Mu_tilde[i] = Mu[np.ix_(M_i)] +\
                    S_MO @ np.linalg.inv(S_OO) @\
                    (X_tilde[i, O_i] - Mu[np.ix_(O_i)])
                X_tilde[i, M_i] = Mu_tilde[i]
                S_MM_O = S_MM - S_MO @ np.linalg.inv(S_OO) @ S_OM
                S_tilde[i][np.ix_(M_i, M_i)] = S_MM_O
        Mu_new = np.mean(X_tilde, axis = 0)
        S_new = np.cov(X_tilde.T, bias = 1) +\
            reduce(np.add, S_tilde.values()) / nr
        no_conv =\
            np.linalg.norm(Mu - Mu_new) >= eps or\
            np.linalg.norm(S - S_new, ord = 2) >= eps
        Mu = Mu_new
        S = S_new
        iteration += 1
    
    result = {
        'mu': Mu,
        'Sigma': S,
        'X_imputed': X_tilde,
        'C': C,
        'iteration': iteration
    }
    
    return result

We can now impute the missing components of X:

start = dt.now()
result_imputed = impute_em(X)
end = dt.now()

The estimates are as shown:

result_imputed['mu'] # estimate using the imputed data

## array([1.54038092, 1.77970878, 6.47510847])

mu # truth

## array([1, 2, 6])

result_imputed['Sigma'] # estimate using the imputed data

## array([[120.12963893,  68.4844652 ,  44.27110069],
##        [ 68.4844652 ,  55.16383917,  19.13314671],
##        [ 44.27110069,  19.13314671,  20.85077943]])

Sigma # truth

## array([[118,  62,  44],
##        [ 62,  49,  17],
##        [ 44,  17,  21]])

The imputation is done as follows:

X[np.arange(0, 9), ] # data with missing components

## array([[-22.51504305,          nan,          nan],
##        [ -4.13801604,          nan,          nan],
##        [ -6.96117925,  -4.78845229,   1.21653198],
##        [  6.57201047,   6.0520226 ,   8.87408451],
##        [         nan,          nan,   9.17663177],
##        [         nan,   1.51178112,          nan],
##        [         nan,  -5.44848469,   4.44208904],
##        [ -4.17891721,  -3.48215875,   4.29849034],
##        [-11.29647092,   1.28621915,          nan]])

result_imputed['X_imputed'][np.arange(0, 9), ] # imputed data

## array([[-22.51504305, -11.93399968,  -2.38998182],
##        [ -4.13801604,  -1.45747717,   4.38246185],
##        [ -6.96117925,  -4.78845229,   1.21653198],
##        [  6.57201047,   6.0520226 ,   8.87408451],
##        [  7.27634923,   4.25868765,   9.17663177],
##        [  1.20775573,   1.51178112,   6.38217986],
##        [ -6.74901264,  -5.44848469,   4.44208904],
##        [ -4.17891721,  -3.48215875,   4.29849034],
##        [-11.29647092,   1.28621915,  -0.84012076]])

X_truth[np.arange(0, 9), ] # truth

## array([[-22.51504305, -11.21931542,  -0.31680441],
##        [ -4.13801604,  -3.3813311 ,   3.97850866],
##        [ -6.96117925,  -4.78845229,   1.21653198],
##        [  6.57201047,   6.0520226 ,   8.87408451],
##        [  8.07906102,   3.46018924,   9.17663177],
##        [ -2.71777693,   1.51178112,   4.17435875],
##        [ -5.25896933,  -5.44848469,   4.44208904],
##        [ -4.17891721,  -3.48215875,   4.29849034],
##        [-11.29647092,   1.28621915,   0.98402706]])

3. Comparison with the R function

It takes:

str(end - start)

## '0:00:11.452083'

seconds to impute X in Python. Using the same X, let’s see how long the process takes in R. I am using a different name (impute_em_R) for the R function:

X <- py$X
head(X, n = 9)

##             [,1]      [,2]
##  [1,] -22.515043       NaN
##  [2,]  -4.138016       NaN
##  [3,]  -6.961179 -4.788452
##  [4,]   6.572010  6.052023
##  [5,]        NaN       NaN
##  [6,]        NaN  1.511781
##  [7,]        NaN -5.448485
##  [8,]  -4.178917 -3.482159
##  [9,] -11.296471  1.286219
##           [,3]
##  [1,]      NaN
##  [2,]      NaN
##  [3,] 1.216532
##  [4,] 8.874085
##  [5,] 9.176632
##  [6,]      NaN
##  [7,] 4.442089
##  [8,] 4.298490
##  [9,]      NaN

We don’t need to replace NaN’s with NA’s because is.na(NaN) returns TRUE.

start_R <- Sys.time()
result_imputed_R <- impute_em_R(X)
end_R <- Sys.time()

Let’s first check if two functions yield the same result:

result_imputed['mu']

## array([1.54038092, 1.77970878, 6.47510847])

result_imputed_R$mu

## [1] 1.540381 1.779709 6.475108

result_imputed['Sigma']

## array([[120.12963893,  68.4844652 ,  44.27110069],
##        [ 68.4844652 ,  55.16383917,  19.13314671],
##        [ 44.27110069,  19.13314671,  20.85077943]])

result_imputed_R$Sigma

##           [,1]     [,2]
## [1,] 120.12964 68.48447
## [2,]  68.48447 55.16384
## [3,]  44.27110 19.13315
##          [,3]
## [1,] 44.27110
## [2,] 19.13315
## [3,] 20.85078

They are all the same, as they should be. What about the time it took to impute in R?

end_R - start_R

## Time difference of 4.443533 secs

Interestingly, R is about 2.5 times as fast as Python for this particular process. One explanation is that in impute_em written in Python, indices corresponding to M_i and O_i are extracted in each iteration of nested loop (for-loop inside the while-loop), whereas all of M_i and O_i are extracted using lapply in R before going over the loops.

Session info

R session info:

sessionInfo()

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Canada.1252 
## [2] LC_CTYPE=English_Canada.1252   
## [3] LC_MONETARY=English_Canada.1252
## [4] LC_NUMERIC=C                   
## [5] LC_TIME=English_Canada.1252    
## 
## attached base packages:
## [1] stats     graphics 
## [3] grDevices utils    
## [5] datasets  methods  
## [7] base     
## 
## other attached packages:
##  [1] ggConvexHull_0.1.0
##  [2] dplyr_0.8.3       
##  [3] reticulate_1.13   
##  [4] pROC_1.15.3       
##  [5] ggrepel_0.8.1     
##  [6] ggplot2_3.2.1     
##  [7] funpark_0.2.6     
##  [8] data.table_1.12.6 
##  [9] boot_1.3-22       
## [10] rmarkdown_1.17    
## [11] magrittr_1.5      
## [12] itertools2_0.1.1  
## 
## loaded via a namespace (and not attached):
##  [1] prettydoc_0.3.1 
##  [2] tidyselect_0.2.5
##  [3] xfun_0.11       
##  [4] purrr_0.3.3     
##  [5] lattice_0.20-38 
##  [6] colorspace_1.4-1
##  [7] vctrs_0.2.0     
##  [8] htmltools_0.4.0 
##  [9] yaml_2.2.0      
## [10] utf8_1.1.4      
## [11] rlang_0.4.2     
## [12] pillar_1.4.2    
## [13] glue_1.3.1      
## [14] withr_2.1.2     
## [15] lifecycle_0.1.0 
## [16] plyr_1.8.4      
## [17] stringr_1.4.0   
## [18] munsell_0.5.0   
## [19] gtable_0.3.0    
## [20] evaluate_0.14   
## [21] labeling_0.3    
## [22] knitr_1.26      
## [23] fansi_0.4.0     
## [24] Rcpp_1.0.3      
## [25] readr_1.3.1     
## [26] scales_1.1.0    
## [27] backports_1.1.5 
## [28] jsonlite_1.6    
## [29] farver_2.0.1    
## [30] gridExtra_2.3   
## [31] png_0.1-7       
## [32] hms_0.5.2       
## [33] digest_0.6.23   
## [34] stringi_1.4.3   
## [35] grid_3.6.1      
## [36] cli_1.1.0       
## [37] tools_3.6.1     
## [38] lazyeval_0.2.2  
## [39] tibble_2.1.3    
## [40] crayon_1.3.4    
## [41] tidyr_1.0.0     
## [42] pkgconfig_2.0.3 
## [43] zeallot_0.1.0   
## [44] MASS_7.3-51.4   
## [45] ellipsis_0.3.0  
## [46] Matrix_1.2-17   
## [47] xml2_1.2.2      
## [48] assertthat_0.2.1
## [49] rstudioapi_0.10 
## [50] iterators_1.0.12
## [51] R6_2.4.1        
## [52] compiler_3.6.1

Python session info:

import sinfo
sinfo.sinfo()

## -----
## matplotlib   3.1.1
## numpy        1.17.0
## scipy        1.1.0
## -----
## Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 23:09:28) [MSC v.1916 64 bit (AMD64)]
## Windows-10-10.0.18362-SP0
## 4 logical CPU cores, Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
## -----
## Session information updated at 2020-01-12 22:47

EM imputation:
Python implementation

Junkyu Park

1. Introduction

2. Functions

Simulating `np.nan`’s

Imputing `np.nan`’s

3. Comparison with the R function

Session info

EM imputation:Python implementation

Junkyu Park

1. Introduction

2. Functions

Simulating np.nan’s

Imputing np.nan’s

3. Comparison with the R function

Session info

Related pages

EM imputation:
Python implementation

Simulating `np.nan`’s

Imputing `np.nan`’s