Alternative Methodology of Location Model for Handling Outliers and Empty Cells Problems: Winsorized Smoothed Location Model

Authors:

Hashibah Hamid,

DOI NO:

https://doi.org/10.26782/jmcms.spl.4/2019.11.00010

Keywords:

Outliers,Winsorization,Non-Parametric Smoothing,Location Model Rule,Misclassification Rate,

Abstract

The location model is a familiar basis for discrimination dealing with mixed binary and continuous variables simultaneously. The binary variables create cells while the continuous variables are information that measures the difference between groups in each cell. But, if some of the created cells are empty, the classical location model rule is biased and sometimes infeasible. Interestingly, the analyses of previous studies have revealed that non-parametric smoothing approach succeeded in reducing the effects of some empty cells immensely. However, one practical drawback to the use of discrimination methods based on the location model is that the smoothing approach employed, its performance is severe when there are outliers in the data sample. The purpose of this paper is to extend these limitations of the location model with the presence of outliers and empty cells. Accordingly, a new location model rule called Winsorized smoothed location model is developed through the combination of Winsorization and non-parametric smoothing approach to address both issues of outliers and empty cells at once. Results from simulation manifests the improvement of the new rule as the rates of misclassification are dramatically declined even the data contains outliers for all 36 different simulation data settings. Findings from real dataset, full breast cancer, also clearly show that the newly developed Winsorized smoothed location model achieves the best performance compared to over than 10 existing discrimination methods. These revealed that the newly derived rule further enhanced the applicability range of the location model, as previously it was limited to the non-contaminated datasets to achieve tolerable performance. The overall investigation verifying the new rule developed offers practitioners another potential good methodology for discrimination tasks, as the rule very favourably compared to all its competitors except only one.

Refference:

I. Alqallaf F, Van Aelst S, Yohai VJ, and Zamar RH (2009). Propagation of
Outliers in Multivariate Data. Ann. Stat., 37(1): 311-331.
II. Altman E (1968). Financial Ratios, Discriminant Analysis and the Prediction
of Corporate Bankruptcy. The Journal of Finance, 23(4): 589-609.
III. Asparoukhov O and Krzanowski WJ (2000). Non-parametric Smoothing of
the Location Model in Mixed Variable Discrimination. Statistics and
Computing, 10(4): 289-297.
IV. Babu GJ, Padmanabhan AR, and Puri ML (1999). Robust One-way ANOVA
under Possibly Non Regular Conditions. Biometrical Journal, 41: 321-339.
V. Bar-Hen A and Daudin JJ (1995). Generalization of the Mahalanobis
Distance in the Mixed Case. Journal of Multivariate Analysis, 53(2): 332-
342.
VI. Basak I (1998). Robust M-estimation in Discriminant Analysis. Indian J.
Stat., 60: 246-268.

VII. Basu A, Bose S, and Purkayastha S. (2004). Robust Discriminant Analysis
using Weighted Likelihood Estimators. Journal of Statistical Computation &
Simulation, 74(6): 445-460.
VIII. Becker C and Gather U (1999). The Masking Breakdown Point of
Multivariate Outlier Identification Rules. J. Am. Stat. Assoc., 94(447): 947-
955.
IX. Berchuck A, Iversen ES, Luo J, Clarke J, Horne H, Levine DA, and
Lancaster JM (2009). Microarray Analysis of Early Stage Serous Ovarian
Cancers shows Profiles Predictive of Favorable Outcome. Clinical Cancer
Research: An Official Journal of the American Association for Cancer
Research, 15(7): 2448-2455.
X. Birzer ML and Craig-Moreland DE (2008). Using Discriminant Analysis in
Policing Research. Professional Issues in Criminal Justice, 3(2): 33-48.
XI. Chen Z-Y and Muirhead RJ (1994). A Comparison of Robust Linear
Discriminant Procedures using Projection Pursuit Methods. Multivar. Anal.
Its Appl., 24: 163-176.
XII. Daudin JJ (1986). Selection of Variables in Mixed-variable Discriminant
Analysis. Biometrics, 42(3): 473-481.
XIII. Eisenbeis RA (1977). Pitfalls in the Application of Discriminant Analysis in
Business, Finance, and Economics. The Journal of Finance, 32(3): 875-900.
XIV. Ekezie DD and Ogu AI (2013). Statistical Analysis/Methods of Detecting
Outliers in Univariate Data in A Regression Analysis Model. International
Journal of Education and Research, 1(5): 1-24.
XV. Farcomeni A and Ventura L (2010). An Overview of Robust Methods in
Medical Research. Statistical Methods in Medical Research, 21(2): 111-133.
XVI. Hamid H (2014). Integrated Smoothed Location Model and Data Reduction
Approaches for Multi Variables Classification. Unpublished Doctoral
Dissertation. Universiti Utara Malaysia.
XVII. Hamid H (2018). New Location Model based on Automatic Trimming and
Smoothing Approaches. Journal of Computational and Theoretical
Nanoscience, 15(2): 493-499.
XVIII. Hamid H, Huong PNA, and Alipiah FM (2018). New Smoothed Location
Models Integrated with PCA and Two Types of MCA for Handling Large
Number of Mixed Continuous and Binary Variables. Pertanika Journal of
Science & Technology, 26(1): 247-260.
XIX. Hawkins DM and McLachlan GJ (1997). High-breakdown Linear
Discriminant Analysis. Journal of American Statistical Association, 72: 151-
162.

XX. Holden JE, Finch WH, and Kelley K (2011). A Comparison of Two-Group
Classification Methods. Educational and Psychological Measurement, 71(5):
870-901.
XXI. Hubert M and Van Driessen K (2004). Fast and Robust Discriminant
Analysis. Computational Statistics and Data Analysis, 45: 301-320.
XXII. Hubert M, Rousseeuw PJ, and Van Aelst S (2008). High-breakdown Robust
Multivariate Methods. Statistical Science, 23(1): 92-119.
XXIII. Keogh BK (2005). Revisiting Classification and Identification. Learning
Disability Quarterly, 28: 100-102.
XXIV. Kim K, Aronov P, Zakharkin SO, Anderson D, Perroud B, Thompson IM,
and Weiss RH (2009). Urine Metabolomics Analysis for Kidney Cancer
Detection and Biomarker Discovery. Molecular & Cellular Proteomics:
MCP, 8(3): 558-570.
XXV. Knoke JD (1982). Discriminant Analysis with Discrete and Continuous
Variables. Biometrics, 38(1): 191-200.
XXVI. Krzanowski WJ (1975). Discrimination and Classification using Both Binary
and Continuous Variables. Journal of Amer. Stat. Assoc., 70(352): 782-790.
XXVII. Krzanowski WJ (1980). Mixtures of Continuous and Categorical Variables in
Discriminant Analysis. Biometrics, 36: 493-499.
XXVIII. Little RJA and Schluchter MD (1985). Maximum Likelihood Estimation for
Mixed Continuous and Categorical Data with Missing Values. Biometrika,
72(3): 497-512.
XXIX. Lix LM and Keselman HJ (1998). To Trim or Not to Trim: Tests of Location
Equality under Heteroscedasticity and Non-normality. Educational and
Psychological Measurement, 115: 335-363.
XXX. Maclaren WM (1985). Using Discriminant Analysis to Predict Attacks of
Complicated Pneumoconiosis in Coalworkers. Journal of the Royal Statistical
Society, Series D (The Statistician), 34(2): 197-208.
XXXI. Mahat NI, Krzanowski WJ, and Hernandez A (2007). Variable Selection in
Discriminant Analysis based on the Location Model for Mixed Variables.
Advance Data Anal. Class., 1(2): 105-122.
XXXII. Mahat NI, Krzanowski WJ, and Hernandez A (2009). Strategies for Nonparametric
Smoothing of the Location Model in Mixed-variable
Discriminant A nalysis. Modern Appl. Sci., 3(1): 151-163.
XXXIII. Poon WY (2004). Identifying Influence Observations in Discriminant
Analysis. Statistical Methods in Medical Research, 13: 291-308.
XXXIV. Rasmussen JL (1988). Evaluating Outlier Identification Tests: Mahalanobis
D Squared and Comrey D. Multivariate Behavioral Research, 23(2): 189-202.
XXXV. Rocke DM, Downs GW, and Rocke AJ (1982). Are Robust Estimators Really
Necessary? Technometrics, 24: 95-101.

XXXVI. Schwager SJ and Margolin BH (1982). Detection of Multivariate Outliers.
The Annals of Statistics, 10: 943- 954.
XXXVII. Tabachnick BG and Fidell LS (1989). Using Multivariate Statistics. Time
Ser. Anal. J. Psychophysiol, 3: 46-48.
XXXVIII. Tadjudin S and Landgrebe DA (2000). Robust Parameter Estimation for
Mixture Model. IEEE Trans. Geosci. Remote Sens., 38(1): 439-445.
XXXIX. Takane Y, Bozdogan H, and Shibayama T (1987). Ideal Point Dicriminant
Analysis. Psychometrika, 52(3): 371-392.
XL. Van Ness JW and Yang JJ (1998). Robust Discriminant Analysis: Training
Data Breakdown Point. J. Stat. Plan. Inference, 67: 67-83.
XLI. Whitlark DB, Geurts MD, and Swenson MJ (1993). New Product Forecasting
with a Purchase Intention Survey. The Journal of Business Forecasting
Methods Systems and Systems, 12(3): 1-18.
XLII. Wilcox RR (2003). Applying Contemporary Statistical Techniques.
Academic Press: San Diego, CA.
XLIII. Yusof ZM, Othman AR, and Syed Yahaya SS (2013). Robustness of
Trimmed F Statistics when Handling Nonnormal Data. Malaysian Journal of
Science, 32(1): 73-77.
XLIV. Zhang MQ (2000). Discriminant Analysis and Its Application in DNA
Sequence Motif Recognition. Briefings in Bioinformatics, 1(4): 1-12.
XLV. Zimmerman DW (1994). A Note on the Influence of Outliers on Parametric
and Nonparametric Tests. Journal of General Psychology, 121(4): 391-401.

View | Download