Article metrics

  • citations in SCindeks: [3]
  • citations in CrossRef:[2]
  • citations in Google Scholar:[=>]
  • visits in previous 30 days:3
  • full-text downloads in 30 days:2
article: 9 from 34  
Back back to result list
Serbian Journal of Management
2013, vol. 8, iss. 1, pp. 9-24
article language: English
document type: Original Paper
published on: 20/05/2013
doi: 10.5937/sjm8-3226
Highly robust methods in data mining
Institute of Computer Science of the Academy of Sciences of the Czech Republic, Praha, Czech Republic

e-mail: kalina@cs.cas.cz

Abstract

This paper is devoted to highly robust methods for information extraction from data, with a special attention paid to methods suitable for management applications. The sensitivity of available data mining methods to the presence of outlying measurements in the observed data is discussed as a major drawback of available data mining methods. The paper proposes several newhighly robust methods for data mining, which are based on the idea of implicit weighting of individual data values. Particularly it propose a novel robust method of hierarchical cluster analysis, which is a popular data mining method of unsupervised learning. Further, a robust method for estimating parameters in the logistic regression was proposed. This idea is extended to a robust multinomial logistic classification analysis. Finally, the sensitivity of neural networks to the presence of noise and outlying measurements in the data was discussed. The method for robust training of neural networks for the task of function approximation, which has the form of a robust estimator in nonlinear regression, was proposed.

Keywords

References

Agresti, A. (1990) Categorical data analysis. New York: Wiley
Beliakov, G., Kelarev, A., Yearwood, J. (2012) Robust artificial neural networks and outlier detection. Technical report. 1069(2012) arxiv.org/pdf/1110.pdf (pristupljeno: 28-11-2012)
Bobrowski, L., Łukaszuk, T. (2011) Relaxed linear separability (RLS) approach to feature (gene) subset selection. in: Xia X. [ed.] Selected works in bioinformatics, Rijeka: InTech, 103-118
Brandl, B., Keber, C., Schuster, M.G. (2006) An automated econometric decision support system: forecasts for foreign exchange trades. Central European Journal of Operations Research, 14(4): 401-415
Briner, R.B., Denyer, D., Rousseau, D.M. (2009) Evidence, based management. Concept cleanup time?. Academy of Management Perspectives, 23(4): 19-32
Buonaccorsi, J.P. (2010) Measurement error. models, methods, and applications. Boca Raton: Chapman & Hall/CRC
Chae, S.S., Kim, C., Kim, J., Warde, W.D. (2008) Cluster analysis using different correlation coefficients. Statistical Papers, 49(4): 715-727
Chen, D.S., Jain, R.C. (1994) A robust backpropagation learning algorithm for function approximation. IEEE Transactions on Neural Networks, 5(3): 467-479
Christmann, A. (1994) Least median of weighted squares in logistic regression with large strata. Biometrika, 81(2): 413-417
Čížek, P. (2011) Semiparametrically weighted robust estimation of regression models. Computational Statistics & Data Analysis, 55(1): 774-788
Čížek, P. (2008) Robust and efficient adaptive estimation of binary-choice regression models. Journal of the American Statistical Association, 103(482): 687-696
Davies, L.P., Gather, U. (2005) Breakdown and groups. Annals of Statistics, 33(3): 977-1035
Dreiseitl, S., Ohno-Machado, L. (2002) Logistic regression and artificial neural network classification models: a methodology review. Journal of biomedical informatics, 35(5-6): 352-9
Dutt-Mazumder, A., Button, C., Robins, A., Bartlett, R. (2011) Neural network modelling and dynamical system theory: are they relevant to study the governing dynamics of association football players?. Sports medicine (Auckland, N.Z.), 41(12): 1003-17
Efendigil, T., Önüt, S., Kahraman, C. (2009) A decision support system for demand forecasting with artificial support networks and neuro, fuzzy models. A comparative analysis. Expert Systems with Applications, 36(3): 6697-6707
Fayyad, U., Shapiro, G.P., Smyth, P. (1996) From data mining to knowledge discovery in databases. AI Magazine, 17(3): 37-54
Fernandez, G. (2003) Data mining using SAS applications. Boca Raton: Chapman & Hall/CRC
Gao, J., Hitchcock, D.B. (2010) James, Stein shrinkage to improve k, means cluster analysis. Computational Statistics & Data Analysis, (54): 2113-2127
García,, Escudero, L.A., Gordaliza, A., Martín, S.R., van Aelst, S., Zamar, R. (2009) Robust linear clustering. Journal of the Royal Statistical Society Series B-Statistical Methodology, 71(1): 301-318
Gruca, T.S., Klemz, B.R., Petersen, A.F.E. (1999) Mining sales data using a neural network model of market response. ACM SIGKDD Explorations Newsletter, 1(1): 39-43
Gunasekaran, A., Ngai, E.W.T. (2012) Decision support systems for logistic and supply chain management. Decision Support Systems and Electronic Commerce, 52(4): 777-778
Hakimpoor, H., Arshad, K.A.B., Tat, H.H., Khani, N., Rahmandoust, M. (2011) Artificial neural networks' applications in management. World Applied Sciences Journal, 14(7): 1008-1019
Hand, D.J. (2006) Classifier Technology and the Illusion of Progress. Statistical Science, 21(1): 1-14
Hastie, T., Tibshirani, R., Friedman, J. (2001) The elements of statistical learning: Data mining, inference, and prediction. New York: Springer
Hekimoglu, S., Erenoglu, R.C., Kalina, J. (2009) Outlier detection by means of robust regression estimators for use in engineering science. Journal of Zhejiang University: Science A, 10(6): 909-921
Jaakkola, T.S. (2013) Machine learning. www.ai.mit.edu/courses/6.867,f04/lectures/ lecture,5,ho.pdf (pristupljeno: 04-01-2013)
Jeng, J.T., Chuang, C.T., Chuang, C.C. (2011) Least trimmed squares based CPBUM neural networks. in: Proceedings International Conference on System Science and Engineering ICSSE 2011, Washington: IEEE Computer Society Press, 187-192
Kalina, J. (2012) Implicitly weighted methods in robust image analysis. Journal of Mathematical Imaging and Vision, 44(3): 449-462
Kalina, J. (2012) On multivariate methods in robust econometrics. Prague Economic Papers, 21(1): 69-82
Kalina, J. (2011) Some diagnostic tools in robust econometrics. Acta Universitatis Palackianae Olomucensis Facultas Rerum Naturalium Mathematica, 50(2): 55-67
Krycha, K.A., Wagner, U. (1999) Applications of artificial neural networks in management science. A survey. Journal of Retailing and Consumer Services, (6): 185-203
Liang, K. (2005) Clustering as a basis of hedge fund manager selection. Technical report. Berkeley: University of California, cmfutsarchive/HedgeFunds/hf_managerselection.pdf (pristupljeno: 20-12-2012)
Liano, K. (1996) Robust error measure for supervised neural network learning with outliers. IEEE Transactions on Neural Networks, 7(1): 246-250
Maronna, R.A., Martin, R.D., Yohai, V.J. (2006) Robust statistics. Theory and methods. Chichester: Wiley
Martinez, W.L., Martinez, A.R., Solka, J.L. (2011) Exploratory data analysis with MATLAB. London: Chapman & Hall/CRC
Mura, L. (2012) Possible applications of the cluster analysis in the managerial business analysis. Information Bulletin of the Czech Statistical Society, 23(4): 27-40
Murtaza, N., Sattar, A.R., Mustafa, T. (2005) Enhancing the software effort estimation using outlier elimination methods for agriculture in Pakistan. Pakistan Journal of Life and Social Sciences, 8(1): 54-58
Nisbet, R., Elder, J., Miner, G. (2009) Handbook of statistical analysis and data mining applications. Burlington: Elsevier
Punj, G., Stewart, D.W. (1983) Cluster Analysis in Marketing Research: Review and Suggestions for Application. Journal of Marketing Research, 20(2): 134
Ritchie, M.D., Hahn, L.W., Roodi, N., Bailey, L.R., Dupont, W.D., Parl, F.F., Moore, J.H. (2001) Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. American Journal of Human Genetics, 69(1): 138-47
Rousseeuw, P.J., Driessen, K. (2006) Computing LTS Regression for Large Data Sets. Data Mining and Knowledge Discovery, 12(1): 29-45
Rusiecki, A. (2008) Robust MCD, based backpropagation learning algorithm. in: Rutkowski L.; Tadeusiewicz R.; Zadeh L.; Zurada J. [ed.] Artificial Intelligence and Soft Computing. Lecture Notes in Computer Science, 5097, 154-163
Salibian-Barrera, M. (2006) The Asymptotics of MM-Estimators for Linear Regression with Fixed Designs. Metrika, 63(3): 283-294
Schafer, J., Strimmer, K. (2005) A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genomics and Molecular Biology, 4(1): 1-30
Seber, F.A.G., Wild, J.C. (1989) Nonlinear Regression. New York: John Wiley & Sons
Shertzer, K.W., Prager, M.H. (2002) Least median of squares. A suitable objective function for stock assessment models?. Canadian Journal of Fisheries and Aquatic Sciences, (59): 1474-1481
Shin, S., Yang, L., Park, K., Choi, Y. (2009) Robust data mining. An integrated approach. in: Ponce J.; Karahoca A. [ed.] Data mining and knowledge discovery in real life applications. I, New York: Tech Education and Publishing
Soda, P., Pechenizkiy, M., Tortorella, F., Tsymbal, A. (2010) Knowledge discovery and computer-based decision support in biomedicine. Artificial Intelligence in Medicine, 50(1): 1-2
Stigler, S.M. (2010) The changing history of robustness. American Statistician, 64(4): 277-281
Svozil, D., Kalina, J., Omelka, M., Schneider, B. (2008) DNA conformations and their sequence preferences. Nucleic Acids Research, 36(11): 3690-706
Tibshirani, R., Walther, G., Hastie, T. (2001) Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2): 411-423
Vintr, T., Vintrová, V., Řezanková, H. (2012) Poisson distribution based initialization for fuzzy clustering. Neural Network World, 22(2): 139-159
Víšek, J.Á. (2001) Regression with high breakdown point. in: Antoch J.; Dohnal G. [ed.] Proceedings of ROBUST 2000, School of JČMF, Prague: JČMF and Czech Statistical Society, 324-356
Yeung, D.S., Cloete, I., Shi, D., Ng, W.W.Y. (2010) Sensitivity analysis for neural networks. New York: Springer
Youden, W.J. (1950) Index for rating diagnostic tests. Cancer, 3(1): 32-5
Zvárová, J., Veselý, A., Vajda, I. (2009) Data, information and knowledge. in: Berka P.; Rauch J.; Zighed D. [ed.] Data mining and medical knowledge management. Cases and applications standards. IGI Global, Hershey, 1-36