Loading...

다시보기

 기념식/회고/집중/특별초청 다시보기  


- 제목/저자 영상
1
기념식
2
회고
3
집중강연1
Zhenhua Lin (NUS, Singapore)
4
집중강연2
Zhenhua Lin (NUS, Singapore)
5
특별초청강연
Peter Buhlmann (ETH, Swiss)
5
학회장 초청강연
류근관(통계청장)




 특별/기획/일반/학생 다시보기  


* 해당 발표논문에 대한 질의는 *Q&A 를 클릭하여 댓글로 작성해주시기 바랍니다.
* 발표논문이 비활성화된 경우는 자료제공에 동의하지 않은 논문입니다.
구두 논문 발표


논문번호 제목/저자/소속 영상
특별 세션 TOP
SS-II-1-1
Scalable and optimal Bayesian inference for sparse covariance matrices via sure screening
*이경재(성균관대), 조성일(인하대), 이재용(서울대)
Summary: In this paper, we consider a high-dimensional setting where the number of variables p can grow to infinity as the sample size n gets larger. We assume that most of off-diagonal entries of the covariance matrix are zero. Several Bayesian methods for sparse covariance matrices have been proposed, but their computational speed is too slow, making them almost impossible to apply even to moderately high dimensions (e.g., p ≈ 200). Motivated by this, we propose a scalable Bayesian method for large sparse covariance matrices. The main strategy of the proposed method is as follows: we first safely reduce the number of effective parameters in a covariance matrix, and then impose shrinkage priors only for selected nonzero off-diagonal entries. To this end, we suggest using the sure screening by keeping only the off-diagonal entries whose absolute sample correlation coefficients are larger than a threshold and furnishing the rests with zeros. It turns out that the proposed prior achieves the minimax or nearly minimax rate for sparse covariance matrices under the Frobenius norm. Therefore, it is not only computationally scalable but also optimal in terms of posterior convergence rate.
SS-II-1-2
Post-Processed Posteriors for Sparse Covariances and Its Application to Global Minimum Variance Portfolio
*이광민(U of Wisconsin), 이재용(서울대)
Summary: We consider Bayesian inference of sparse covariance matrices and propose a post-processed posterior. This method consists of two steps. In the first step, posterior samples are obtained from the conjugate inverse-Wishart posterior without considering the sparse structural assumption. The posterior samples are transformed in the second step to satisfy the sparse structural assumption through the hard-thresholding function. This non-traditional Bayesian procedure is justified by showing that the post-processed posterior attains the optimal minimax rates. We also investigate the application of the post-processed posterior to the estimation of the global minimum variance portfolio. We show that the post-processed posterior for the global minimum variance portfolio also attains the optimal minimax rate under the sparse covariance assumption. The advantages of the post-processed posterior for the global minimum variance portfolio are demonstrated by a simulation study and a real data analysis with S&P 400 data.
SS-II-1-3
A Scalable Partitioned Approach to Model Massive Nonstationary Non-Gaussian Spatial Datasets
Benjamin Seiyon Lee(George Mason University), *박재우(연세대)
Summary: Nonstationary non-Gaussian spatial data are common in many disciplines, including climate science, ecology, epidemiology, and social sciences. Examples include count data on disease incidence and binary satellite data on cloud mask (cloud/no-cloud). Modeling such datasets as stationary spatial processes can be unrealistic since they are collected over large heterogeneous domains (i.e., spatial behavior differs across subregions). Although several approaches have been developed for nonstationary spatial models, these have focused primarily on Gaussian responses. In addition, fitting nonstationary models for large non-Gaussian datasets is computationally prohibitive. To address these challenges, we propose a scalable algorithm for modeling such data by leveraging parallel computing in modern high-performance computing systems. We partition the spatial domain into disjoint subregions and fit locally nonstationary models using a carefully curated set of spatial basis functions. Then, we combine the local processes using a novel neighbor-based weighting scheme. Our approach scales well to massive datasets (e.g., 1 million samples) and can be implemented in \texttt{nimble}, a popular software environment for Bayesian hierarchical modeling. We demonstrate our method to simulated examples and two large real-world datasets pertaining to infectious diseases and remote sensing.
SS-II-2-1
Set-based rare variant association tests for biobank scale sequencing data sets
Summary: With very large sample sizes, biobanks provide an exciting opportunity to identify genetic components of complex traits. To analyze rare variants, region-based multiple-variant aggregate tests are commonly used to increase power for association tests. However, because of the substantial computational cost, existing region-based tests cannot analyze hundreds of thousands of samples while accounting for confounders such as population stratification and sample relatedness. To address it, we developed a scalable generalized mixed-model region-based association test, SAIGE-GENE, that is applicable to exome-wide and genome-wide analysis for hundreds of thousands of samples and can account for unbalanced case– control ratios for binary traits. Recently, we further improve computation time and type I error rate control, and developed SAIGE-GENE+. We applied SAIGE-GENE+ to UK Biobank (UKBB) whole-exome sequencing (WES) data for 200,000 participants. In the analysis of 30 quantitative and 141 binary traits, SAIGE-GENE+ identified 551 gene-phenotype associations. In addition, we showed that incorporating multiple MAF cutoffs and functional annotations can help identify novel gene-phenotype associations. The analysis results are publicly available in PheWEB like web server (https://ukb-200kexome.leelabsg.org/)
SS-II-2-2
Deep nonnegative matrix factorization via variational autoencoder with application to single-cell RNA sequencing data
지동준(KAIST), Yixin Kong(Boston University), *전현호(KAIST),
Summary: Single-cell RNA sequencing technology enables analyzing the gene expression of individual cells, broadening understanding of biological phenomena, and being widely used in numerous biomedical studies. Recently, the variational autoencoder has emerged and adopted in analyzing the single-cell data due to its high capacity in managing large-scale data. Many different variants of the variational autoencoder were applied, which yielded supreme results. Yet, being nonlinear, the model does not give parameters that can be used for explaining underlying biology. In this paper, we propose an interpretable nonnegative matrix factorization method that decomposes parameters into the ones shared across cells and the cell-specific ones, while achieving effective nonlinear dimension reduction via variational autoencoder applied on the cell-specific parameters. Our model achieves nonlinear dimension reduction and the estimation of cell type specific gene expression simultaneously. To improve the estimation accuracy, we introduce log regularization, reflecting the single-cell property. Our approach shows excellent performance in a simulation study and real data analyses, which is exciting as this nice performance is achieved while maintaining biological interpretability.
SS-II-2-3
Calibration test for risk models at the tails of disease risk distribution
Summary: Risk-prediction models need careful calibration to ensure they produce unbiased estimates of risk for subjects in the underlying population given their risk-factor profiles. As subjects with extreme high or low risk may be the most affected by knowledge of their risk estimates, checking the adequacy of risk models at the extremes of risk is very important for clinical applications. We propose a new approach to test model calibration targeted toward extremes of disease risk distribution where standard goodness-of-fit tests may lack power due to sparseness of data. We construct a test statistic based on model residuals summed over only those individuals who pass high and/or low risk thresholds and then maximize the test statistic over different risk thresholds. We derive an asymptotic distribution for the max-test statistic based on analytic derivation of the variance–covariance function of the underlying Gaussian process. The method is applied to a large case–control study of breast cancer to examine joint effects of common single nucleotide polymorphisms (SNPs) discovered through recent genome-wide association studies. The analysis clearly indicates a non-additive effect of the SNPs on the scale of absolute risk, but an excellent fit for the linear logistic model even at the extremes of risks.
SS-II-3-1
강화학습 기반 서울시 공공자전거 재배치 전략 도출
Summary: 본 연구에서는 강화학습 기반 서울시 공공자전거 재배치 방법론을 개발한다. 효과적인 재 배치 전략을 위해서는 계절적 요소, 환경적 요소, 유동인구, 교통량, 자전거 실시간 이용 패턴 등 다 양한 변수들이 고려됨과 동시에 운영비용 및 운영인력들의 실질적인 요소들을 고려하여야 한다. 이를 위해 본 연구에서는 Unity에 공공자전거 재배치를 위한 서울시 디지털 트윈을 구축하고, 강화 학습을 통해 보다 효과적이며 시각적으로 확인 가능한 재배치 방법론을 개발하였다. 제안된 방법 론은 수요공급 불균형 문제를 해소함과 동시에 자전거 재배치 무인화에 기여할 수 있을 것으로 판 단된다.
SS-II-3-2
Robust Tests in Online Decision-Making
*김지수(UNIST), Jane Paik Kim(Stanford U.), 양현준(Stanford U.)
Summary: Bandit algorithms are widely used in sequential decision problems to maximize the cumulative reward. One potential application is mobile health, where the goal is to promote the user's health through personalized interventions based on user specific information acquired through wearable devices. Important considerations include the type of, and frequency with which data is collected (e.g. GPS, or continuous monitoring), as such factors can severely impact app performance and users’ adherence. In order to balance the need to collect data that is useful with the constraint of impacting app performance, one needs to be able to assess the usefulness of variables. Bandit feedback data are sequentially correlated, so traditional testing procedures developed for independent data cannot apply. Recently, a statistical testing procedure was developed for the actor-critic bandit algorithm (Lei et al., 2017). An actor-critic algorithm maintains two separate models, one for the actor, the action selection policy, and the other for the critic, the reward model. The performance of the algorithm as well as the validity of the test are guaranteed only when the critic model is correctly specified. However, misspecification is frequent in practice due to incorrect functional form or missing covariates. In this work, we propose a modified actor-critic algorithm which is robust to critic misspecification and derive a novel testing procedure for the actor parameters in this case.
SS-II-3-3
An efficient way to solve unsupervised anomaly detection problems using deep neural networks
*김동하(성신여대), 황재성(SK텔레콤), 김건웅(서울대), 김용대(서울대)
Summary: Identifying whether a given sample is an outlier or not is a significant issue in various real-world domains. Many trials have developed outlier detection methods, but they mainly presumed no outliers in the training data set. This paper considers a more general situation that training data contains some outliers, and any information about inlier and outlier is not given. We propose a powerful and efficient learning framework to identify inliers in a training data set. Our method consists of two steps. First, we train a generative model with a modified loss of the variational auto-encoders and sort the training data into two groups based on the per-sample loss. With the group labels, we then train a 2-class predictive model using a virtual adversarial training algorithm and identify inliers with their per-sample loss values. We demonstrate empirically that our method can refine inliers successfully image and non-image data sets.
SS-II-4-1
Additive Functional Regression for Densities as Responses
*한경희(U of Illinois-Chicago), Hans-Georg Müller(University of California), 박병욱(서울대)
Summary: We propose and investigate an additive functional regression model for situations where the responses are random distributions that can be viewed as random densities and the predictors are vectors. Data in the form of samples of densities or distributions are increasingly encountered in statistical analysis and there is a need for flexible regression models that accommodate random densities as responses. Such models are of special interest for multivariate continuous predictors, where unrestricted nonparametric regression approaches are subject to the curse of dimensionality. Additive models can be expected to maintain one-dimensional rates of convergence while permitting a substantial degree of flexibility. This motivates the development of additive regression models for situations where multivariate continuous predictors are coupled with random density responses. To overcome the problem that distributions do not form a vector space, we utilize a class of transformations that map densities to unrestricted square integrable functions and then deploy an additive functional regression model to fit the responses in the unrestricted space, finally transforming back to density space. We implement the proposed additive model with an extended version of smooth backfitting and establish the consistency of this approach, including rates of convergence.
SS-II-4-2
Small Sphere Distributions and Related Topics in Directional Statistics
Summary: The research is motivated by advancing the statistical shape analysis to understand the variation of shape changes in 3D objects. The first part studies a parametric approach for multivariate directional data lying on a product of spheres. Two kinds of concentric unimodal-small subsphere distributions are introduced. The first kind coincides with a special case of the Fisher-Bingham distribution; the second is a novel adaption that independently models horizontal and vertical variations. In its multi-subsphere version, the second kind allows for correlation of horizontal variations over different subspheres. Working as models to fit the major modes of variation, the proposed distributions properly describe shape changes of skeletally-represented 3D objects due to rotation, twisting and bending. In particular, the multi-subsphere version of the second kind accounts for the underlying horizontal dependence appropriately. The second part is a proposal of hypothesis test that is applicable to the analysis of principal nested spheres (PNS). In PNS, determining which subsphere to fit, among the geodesic (great) subsphere and non-geodesic (small) subsphere, is an important issue and it is preferred to fit a great subsphere when there is no major direction of variation in the directional data. The proposed test utilizes the measure of multivariate kurtosis. The change of the multivariate kurtosis for rotationally symmetric distributions is investigated based on modality. The test statistic is developed by modifying the sample kurtosis. The asymptotic sampling distribution of the test statistic is also investigated. The proposed test is seen to work well in numerical studies with various data situations
SS-II-4-3
위상학적 자료 분석(Topological Data Analysis)의 통계적 추정
신재혁(Carnegie Mellon University), *김지수(Inria Saclay), Alessandro Rinaldo(Inria Saclay), Larry Wasserman(Inria Saclay)
Summary: 위상학적 자료 분석(Topological Data Analysis)은 포괄적으로는 자료에서 위상학적인 특성을 추출하는 분석 방법을 통칭한다. 대표적으로는 persistent homology가 있는데, 자료를 여러 해상에서 관측하고 지속성 있게 나타나는 위상 특성을 추려냄으로써 분석한다. 이 때, 자료로부터 계산한 persistent homology는 자료 분포의 임의성에 의해 오차가 생기는데, 이를 통계적으로 정량화할 수 있 다. 이 발표에서는 persistent homology를 간략하게 소개하고, 그의 통계적 추정을 어떻게 하는지 알아 본다. 구체적으로, 자료 분포의 확률밀도함수의 윗레벨집합(upper level set)의 persistent homology를 추정하고자 하는 목표로 한다. 이를 추정하기 위해 핵밀도추정(kernel density estimator)의 윗레벨집합 의 persistent homology를 추정량으로 사용한다. 이때, 추정된 persistent homology에서 유의한 위상학적 특성을 추려내기 위한 붓스트랩 신뢰띠(bootstrap confidence band)를 어떻게 계산하는지 소개한다. 우 선, 격자 위에서 perisistent homology를 계산했을 때 신뢰띠를 어떻게 계산하는지 소개한다. 하지만, 격자 위에서 persistent homology를 계산하는 것은 자료가 고차원이거나 위상학적 특성들의 크기가 다 를 때에는 계산이 거의 불가능하다. 이에 따라, 격자 대신 Vietoris-Rips 복합체(complex)를 사용할 수 있는데, 이 때 통계적으로 유효한 붓스트랩 신뢰띠를 어떻게 계산하는지 소개한다.
SS-II-5-1
비모수 베이지안 모형을 이용한 재현자료 생성
Summary: 본 발표에서는 비모수 베이지안 모형을 이용한 재현자료생성에 대한 방법론 및 실제사례 를 소개한다. 비모수 베이지안 모형은 지난 10여년간 이론 개발 및 각종 실제 응용에 대한 연구가 활 발하게 진행되었으며, 최근 무응답 모형 및 재현자료생성 모형으로서도 효용이 높다는 연구가 꾸준 히 발표되었다. 본 연구에서는 방법론의 개괄 및 실제 사례로서, 미 통계국(U.S. Census Bureau)에서 진행된 U.S. Economic Census 재현자료 생성 프로젝트에 대하여 소개하고, 재현자료 생성 및 활용의 향후 전망에 대하여서도 간략히 논의한다
SS-II-5-2
딥러닝 기반 테이블 합성 기술 소개
Summary: 딥러닝 기반 테이블 합성 기술을 원본 테이블의 프라이버시를 보호한다 거나 원본 테이블 의 부족한 데이터를 증강하는 목적으로 활용 가능하다. 현재까지 많은 딥러닝 모델들이 제안되어 왔으며, 몇몇 모델의 경우 벤치마크 데이터에서 꽤 쓸 만한 합성 성능을 보이고 있다. 본 발표에서는 지금까지 제안된 여러 모델들 중 핵심이 되는 기술들을 소개하고 각 모델들의 장단점에 대해서 면 밀히 분석한다. 대부분의 모델은 현재 적대적 생성망 기술을 활용하여 설계된 바, 해당 기술에 대한 소개도 같이 발표한다. 하지만 변분 오코인코더 등의 기술로 설계된 모델도 있는 만큼 관련 기술에 대한 광범위한 소개도 포함한다. 추가적으로 벤치마크 환경 뿐만 아니라 일반적인 환경에서도 기존 기술들이 합성 성능을 유지할 수 있는지 살펴 보기 위해서 UCI 기계학습 데이터 레포지토리에서 활 발히 다운로드된 100개 이상의 테이블을 대상으로 대량의 성능 작업을 수행한 결과를 소개한다. 이 러한 테이블들은 벤치마크 테이블과는 다른 특성을 가지고 있는 것들로 훨씬 합성하기 어려운 경우 에 해당한다. 분석의 결과 실제로 기존에 잘 동작한다고 알려졌던 많은 기술들이 낮은 합성 능력을 보이는 경우가 많았으며, 그러한 테이블의 특성들에 대해서 분석한다. 마지막으로 이러한 대량 실 험과 분석 결과와 더불어 여러 모델들의 분석에 대한 결과로 향후 딥러닝 기반 테이블 합성이 어떠 한 방향으로 가야 할지 방향을 제시하면서 본 발표를 마무리한다.
SS-III-1-1
Transformed Function on Scalar Regression for Random Distribution
*양호진(부산대), 정상훈(부산대), 안미혜(U. of Nevada Las Vegas)
Summary: The aim of this paper is to develop a transformed function on the scalar regression model, using the functional principal components to account for random distribution. This framework allows us to model functions transformed from random distributions by using the functional principal components approach in a transformed functional space, and then to regress functional principal component scores on multiple sets of predictors in their projected space. Thereby, we can estimate the underlying model parameters as well as the effect of the covariates in the projected space. Then, these parameters are transformed back to the original distributional space to understand the subject-specific random distributions. We also conduct hypothesis testing and predict random distributions for any given predictors
SS-III-1-2
On estimation and selection for semiparametric models in meta-analysis
Summary: Combining large-scale datasets of multiple studies is a valuable approach to fully utilizing the collected data. However, such studies often have privacy policies or data transfer issues that prevent individual-level data sharing. Meta-analysis combines large-scale datasets using compressed information in summary statistics without requiring individual-level data. We develop general likelihood theory on meta-analysis with semiparametric models. The theoretical framework embraces meta-analysis of studies with different observation schemes that generate various data types. We propose a method of meta-estimation and selection based on summary statistics. The resulting estimator has desirable asymptotic properties under mild assumptions. The superior performance and practical utility of the proposed method are demonstrated through numerical studies.
SS-III-1-3
Capturing network and dynamic effects in bike sharing system
*최연진(서울시립대), 손혜림(서울시립대), 조해란(U. of Bristol)
Summary: Given a dataset with network structures, one of the common research interests is to model nodal features accounting for network effects. In this study, we investigate shared-bike data in Seoul, under a spatial network framework focusing on the rental counts of each station. Our proposed method models rental counts via a generalized linear model with regularizations. The regularization is made via fused lasso penalty which is devised to capture network effect. In this model, parameters are posed in a station-specific manner. The fused lasso penalty terms are applied on the parameters associated with locationally nearby stations. This approach facilitates parameters corresponding to neighboring stations to have the same value and account for underlying network effect in a data-adaptive way. The proposed method shows promising results.
SS-III-2-1
Causal Inference for Overlapping Matched Samples
*이권상(서울대), Jose Zubizarreta(Harvard Medical School)
Summary: In recent years, there have been many advancements in matching methods. Depending on the goal of each matching scheme, different objective functions are considered, and thus different matched pairs can be obtained. These matched pairs are valid but may produce different results during analysis. We propose a new framework to make simultaneous inference for several matched samples that may overlap. We also develop a sensitivity analysis method based on randomization inference that can provide conclusions robust to the choice of matching methods. The proposed method is applied to the study of the effect of the 2010 Chilean earthquake on student achievement.
SS-III-2-2
Analysis of regression discontinuity designs using censored data
*조영주(건국대), Chen Hu(Johns Hopkins U.), Debashis Ghosh(U. of Colorado Anschutz Medical Campus)
Summary: In many medical studies, the choice of treatment may be determined by a covariate threshold. In these cases, the causal treatment effect is often of great interest, especially when there is a lack of evidence from randomized clinical trials. A class of methods known as regression discontinuity (RD) designs can be used to estimate the treatment effect in this situation. Under certain assumptions, such an estimand enjoys a causal interpretation. We show how to estimate causal effects under the regression discontinuity design for censored data. We illustrate the proposed method by evaluating the causal effect of prostate-specific antigen (PSA)-dependent screening strategies.
SS-III-2-3
Semi-Parametric Contextual Bandits with Graph-Laplacian Regularization
*최영근(숙명여대), 김지수(UNIST), 백승훈(U. of California Berkley), Myunghee Cho Paik(서울대)
Summary: Non-stationarity is ubiquitous in human behavior and addressing it in the contextual bandits is challenging. Several works have addressed the problem by investigating semi-parametric contextual bandits and warned that ignoring non-stationarity could harm performances. Another prevalent human behavior is social interaction which has become available in a form of a social network or graph structure. As a result, graph-based contextual bandits have received much attention. In this paper, we propose a novel contextual Thompson-sampling algorithm for a graph-based semi-parametric reward model. Our algorithm is the first to be proposed in this setting. We derive an upper bound of the cumulative regret that can be expressed as a multiple of a factor depending on the graph structure and the order for the semi-parametric model without a graph. We evaluate the proposed and existing algorithms via simulation and real data example.
SS-III-3-1
Extensive networks would eliminate the demand for pricing formulas
전재기(서울대), 박경훈(The Chinese U. of Hong Kong), *허정규(전남대)
Summary: In this study, we generate a large number of implied volatilities for the Stochastic Alpha Beta Rho (SABR) model using a graphics processing unit (GPU) based simulation and enable an extensive neural network to learn them. This model does not have any exact pricing formulas for vanilla options, and neural networks have an outstanding ability to approximate various functions. Surprisingly, the network reduces the simulation noises by itself, thereby achieving as much accuracy as the Monte-Carlo simulation. Extremely high accuracy cannot be attained via existing approximate formulas. Moreover, the network is as efficient as the approaches based on the formulas. When evaluating based on high accuracy and efficiency, extensive networks can eliminate the necessity of the pricing formulas for the SABR model. Another significant contribution is that a novel method is proposed to examine the errors based on nonlinear regression. This approach is easily extendable to other pricing models for which it is hard to induce analytic formulas.
SS-III-3-2
Detecting voice spoofing attacks using residual network, max feature map, and depthwise separable convolution
*곽일엽(중앙대), 곽성수(삼성연구원), 이준희(삼성연구원), 양종훈(중앙대), 허준호(삼성연구원), 이종훈(삼성연구원), 윤지원(고려대)
Summary: The “2019 Automatic Speaker Verification Spoofing and Countermeasures Challenge” aimed to facilitate the designing of highly accurate voice spoofing attack detection systems. The competition did not emphasize model complexity and latency requirements, but such constraints are strict and integral in real-world deployment. Hence, most of the top-performing solutions from the competition used an ensemble approach and combined multiple complex deep learning models to maximize the detection accuracy, an approach that struggles with real-world deployment constraints. To design a lightweight system, we combined skip connection (from ResNet) and max feature map (from Light CNN) and evaluated the accuracy of the system using the ASVspoof 2019 dataset. With an optimized constant Q transform feature, our single model achieved a replay attack detection EER of 0.30% on the evaluation set, outperforming the top ensemble system in the competition that achieved an EER of 0.39%. To optimize model sizes, we experimented with depthwise separable convolutions (from MobileNet), reducing the number of parameters by 15.7% (from 286K to 45K) while preserving slightly increased performance (EER of 0.36%). Further, we applied Grad-CAM to better explain which regions of spectrograms are significantly contributing to detection of spoofed samples.
SS-III-3-3
Multiple instance neural networks based on sparse attention for cancer detection using T-cell receptor sequences
*김영훈(성신여대), 박성오(성신여대)
Summary: Early detection of cancer is essential to increase the survival rate of cancer patients, and recently, cancer diagnosis using T-cell receptors (TCR) has been widely studied. TCRs bind to certain antigens found on cancer cells so that we can identify the cancer patients by analyzing their T cells. Multiple instance learning methods classify the patients using the information of multiple receptors in T cells. In this study, we propose a multiple instance neural network with sparse attention to enhance the performance of cancer detection and explainability. In the real case study, we verified our proposed method outperforms previous approaches.
SS-III-4-1
Global wind modeling with transformed Gaussian processes
Summary: Uncertainty quantification of wind energy potential from climate models can be limited because it requires considerable computational resources and is time-consuming. We propose a stochastic generator that aims at reproducing the data-generating mechanism of climate ensembles for global annual, monthly, and daily wind data. Inferences based on a multi-step conditional likelihood approach are achieved by balancing memory storage and distributed computation for a large data set. In the end, we discuss a general framework for modeling non-Gaussian multivariate stochastic processes by transforming underlying multivariate Gaussian processes
SS-III-4-2
Spatial Scan Statistics: On Overview and Recent Advances
Summary: The spatial scan statistic is one of the most popular methods for identifying local spatial clusters. It has been developed for several probability models and applied to various areas including geographic disease surveillance. In this talk, an overview of spatial scan statistics will be given and recent advances such as optimizing the maximum reported cluster size in different models will be discussed.
SS-III-4-3
A general panel break test based on the self-normalization method
*최지은(부경대), 신동완(이화여대)
Summary: We propose new break tests for parameters such as mean, variance, quantile and others of panel data sets, in a general setup based on the self-normalization method. The self-normalization tests show much better size than existing tests, resolving their over-size problem for panels with serial dependence, cross-sectional dependence, conditional heteroscedasticity and/or N relative larger than T, which is demonstrated theoretically by a nuisance parameter free limiting null distribution and experimentally by very stable finite sample sizes. The proposed test is also implemented much more easily than the existing tests in that the proposed test needs no bandwidth selection for the long-run variance estimation and is computed very simply. Applications of the self-normalization test to the financial stock return and realized volatility indicate more toward absence of breaks of mean and/or variance than the existing tests which neglect cross-sectional correlation and other features apparent in the data sets.
SS-III-5-1
On cross-covariance structure of intensity functions for multivariate Log-Gaussian Cox process models
Summary: Log-Gaussian Cox Process models are one of the most popular point process models for spatial and spatio-temporal point patterns. Their popularity is due to the fact that log-transformed stochastic intensity functions are modeled as Gaussian random fields and there are various parametric univariate and multivariate covariance models for spatial and spatio-temporal processes to model the spatial and spatio-temporal dependent structures of such Gaussian random fields. Two most commonly used covariance models are (1) so-called Linear Model of Coregionalization and (2) multivariate Matern models. In this work, we study implications of cross-covariance structure of stochastic intensity functions of LGCP based on choices of covariance structure and some fundamental differences between them. We illustrate our points utilizing spatial point pattern data from location of terrorist attack patterns by two major groups in Nigeria. This is joint work with Lingling Chen.
SS-III-5-2
Latent Space Accumulator Model for Analyzing Bipartite Networks with Its Connection Time and Its Applications to Item Response Data with Response Time
윤종현(연세대), 김현주(연세대), 전민정(U. of California), *진익훈(연세대)
Summary: One of the network data analyses that has recently been much received is how the connection time between pairs of nodes affects network structures, such as transitivity and assortativity. However, there are not many statistical models that analyze the effect of connection time on network structure, especially, for the bipartite network whose nodes are divided into two node sets, and only connected between two nodes in different sets are allowed. In this article, we propose a new novel model, so-called the latent space accumulator model, for analyzing bipartite networks that have their connection time for each connection-type to estimate the effect of connection time on network structure. To model connection times for each connection-type that is mutually exclusive, we adopt the competing risk modeling framework that is commonly used in survival data analysis for competing events. To identify the effect of time on network structure, we embedded latent spaces, one of the common approaches in network data analysis, into the competing risk models. Our model has been successfully applied to the item response data with response time, which can be regarded as one example of bipartite networks.
SS-III-5-3
Baeysian Convolutional Networks-based Generalized Linear Model
*전예슬(연세대), 최석준(연세대), 장원(연세대), 전성현(연세대), 박재우(연세대)
Summary: Neural networks provide complex function approximations between inputs and a response variable for a wide variety of applications. Examples include a classification for images and regression for spatially or temporally correlated data. Although neural networks can improve prediction performance compared to traditional statistical models, interpreting the impact of explanatory variables is difficult. Furthermore, uncertainty quantification for predictions and inference for model parameters are not trivial. To address these challenges, we propose a new Bayes approach by embedding convolutional neural networks (CNN) within the generalized linear models (GLM) framework. Using extracted features from CNN as informative covariates in GLM, our method can improve prediction accuracy and provide interpretations of regression coefficients. We show that the posterior distributions of model parameters asymptotically follow mixture normals. We apply our methods to simulated and real data examples, including non-Gaussian spatial data, brain tumor image data, and fMRI data. The algorithm can be broadly applicable to correlated data and quickly provide accurate Bayesian inference.
SS-IV-1-1
Classification accuracy as a proxy for two-sample testing
*김일문(연세대), Aaditya Ramdas(Carnegie Mellon U.), Aarti Singh(Carnegie Mellon U.), Larry Wasserman(Carnegie Mellon U.)
Summary: When data analysts train a classifier and check if its accuracy is significantly different from a half, they are implicitly performing a two-sample test. We investigate the statistical optimality of this indirect but flexible method in the high-dimensional setting. We provide a concrete answer for the case of distinguishing Gaussians with mean-difference and common covariance, by contrasting the indirect approach using variants of linear discriminant analysis such as naive Bayes, with the direct approach using corresponding variants of Hotelling's test. Somewhat surprisingly, the indirect approach achieves the same power as the direct approach in terms of the parameters of interest, and is only worse by a constant factor. Other results of independent interest are sprinkled along the way, like minimax lower bounds, and optimality of Hotelling's test when the dimension grows slower than the sample size. Simulation results validate our theory, and we present practical takeaway messages along with a slew of open problems
SS-IV-1-2
EXoN: EXplainable encoder Network
안승환(서울시립대), 최호식(서울시립대), *전종준(서울시립대)
Summary: 본 논문은 EXoN (EXplainable encoder Network)에 의해 설명 가능한 잠재 변수 공간을 생성 할 수 있는 VAE (Variational AutoEncoder, 변분 자동인코더)의 새로운 준지도 학습 방법을 제안한다. EXoN은 VAE를 구현하는데 있어 두 가지 유용한 장점이 있다. 첫째, 잠재 변수 공간은 혼합 분포로 정의된 잠재 변수 분포의 다봉분포 특성에 의하여 분할되며, 특정 레이블 (label)에 대응되는 혼합 분 포의 요소 (mixture component)에 개념적인 중심 좌표를 자유롭게 할당하는 것이 가능하다. 다음으 로, EXoN으로부터 얻을 수 있는 간단한 통계량을 활용하여 잠재 변수의 부분 공간을 쉽게 탐색할 수 있다. 우리는 교차 엔트로피와 쿨백-라이블러 발산이 설명 가능한 잠재 변수 공간의 구성에 있어 매우 중요한 역할을 하며, 모형으로부터 생성된 이미지의 다양성이 ‘활성화된 잠재 변수의 부분 공 간’으로 불리는 특정한 부분 공간에 의존하는 것을 발견하였다.
SS-IV-1-3
A Gradient-Based Variable Selection for Binary Classification in Reproducing Kernel Hilbert Space
강종경(고려대), *신승준(고려대)
Summary: Variable selection is essential in high-dimensional data analysis. Although various variable selection methods have been developed, most relies on the the linear model assumption. In this article, we propose a nonparametric variable selection method for the large-margin classifier defined on reproducing kernel Hilbert space (RKHS). Motivated by Yang et al. (2016), we propose a gradient-based representation of the large-margin classifier and then regularize the gradient functions by the group-lasso penalty to obtain sparse gradients that naturally leads the variable selection. The groupwise-majorization-decent (GMD, Yang and Zou) algorithm is proposed to efficiently solve the proposed problem with a large number of parameters. We employ the strong sequential rule (Tibshirani et al., 2012) to facilitate the tuning procedure. The selection consistency of the proposed method is established by obtaining the risk bound of the estimated classifier and its gradient. Finally, we demonstrate the promising performance of the proposed method through both the simulations and real data illustration.
SS-IV-2-1
COVID-19 자료에 대한 시공간 분석과 모델링에 대한 소개
*최정순(한양대), 강다연(한양대)
Summary: 2019년 말 코로나바이러스감염증-19 (COVID-19) 확진자가 중국에서 보고된 이후, 전 세계 적으로 많은 나라에서 COVID-19 확진자수는 기하급수적으로 증가하고 있으며, 2020년 3월 11일 세 계보건기구 (WHO)는 코로나19에 대해 팬데믹(pandemic; 세계적 대유행)을 선언하였다. 국내에서 는 2020년 1월 20일 첫 COVID-19 확진자가 발생한 이후 2021년 9월 6일 기준 약 26만 명 이상의 확진 자가 발생하였다. 본 연구에서는 국내⋅외 COVID-19 자료에 대한 시공간 분석과 모델링에 대하여 사례 중심으로 소개하고자 한다. 이를 통해 COVID-19의 공간적 확산 패턴을 이해하고 확산 방지를 위한 의사 결정 수립, 방역 정책의 효과 평가 등에 도움을 줄 수 있다.
SS-IV-2-2
A mobility-dependent SEIR model for assessing the effectiveness of non-pharmaceutical interventions during the COVID-19 pandemic
정승필(서울대), *이우주(서울대)
Summary: 2020년 1월 20일 국내 첫 COVID-19 환자가 발생한 이후 지자체, 중앙방역대책본부, 유관 방 역기관들은 COVID-19 범유행을 저지하기 위한 노력을 경주하고 있다. 특히 정부는 사회적 거리두 기와 집합 금지와 같은 비약물적 통제방법을 주요 방역수단으로 고려해왔다. 직관적으로 보면 사회 적 거리두기와 집합 금지는 인구 집단의 모빌리티를 크게 감소시키고, 이 변화된 모빌리티가 확진 자 수 발생 빈도를 떨어뜨리는 것으로 이해된다. 본 발표에서는 정부의 비약물적 통제-모빌리티-확 진자 수 관계를 수정된 SEIR 모형을 통해 정량적으로 분석해보고자 한다. 특히 서로 다른 비약물적 통제방법이 SEIR 모형에 어떻게 반영되는지에 대해 살펴보고 실제 서울시 데이터를 통해 제안된 SEIR 모형의 적합도를 살펴본다.
SS-IV-2-3
Review on genetic factors associated with COVID-19 and Statistical methods
Summary: The outbreak of 2019-novel coronavirus disease (COVID-19) started in late 2019, and in a short time, it has spread rapidly all over the world. Some antiviral and anti-inflammatory medications became available, but thousands of people are dying daily. Well-understanding of the SARS-CoV-2 genome is not only essential and several investigation revealed the importance of genetics for overcoming the SARS-CoV-2. In this talk, the most critical findings related to the genetics of the SARS-CoV-2 and the statistical methods will be reviewed.
SS-IV-3-1
Survey data integration with information from several sources
Summary: In the era of big data, multiple data sources are available for statistical inference with complex survey data. We consider the idea of data integration by combining an independent probability sample with non-probability sample, e.g. census data. An area-level model approach to combining information from several sources is considered in the context of statistical inference with complex survey data. At each small area, several estimates are computed and linked through a system of structural error models. Also, We propose a novel approach for parameter estimation using an EM algorithm based on the approximate predictive distribution of the parameter of interest. A simulation study shows that the proposed method can provide valid estimation and have better coverage rates than direct estimator. We apply it to a small area estimation problem and to calibration estimation using labor force surveys in Korea
SS-IV-3-2
공공기관의 자료 연계 사례 분석
*김영민(경북대), 임종호(연세대)
Summary: 행정자료와 조사자료의 연계에 대한 관심이 빅데이터 시대에 맞춰 증대되고 있으며, 최근 대한민국 정부는 그것에 대응해서 공공데이터의 적극적인 공개를 통한 활용을 증대시키려 노력하 고 있다. 자료 연계의 목적은 행정자료 연계를 통한 조사자료의 데이터 품질 및 신뢰성 향상과 새롭 게 연계된 정보를 가지고다양한 데이터 분석 및 모형구축을 통해서 데이터 속의 내재된 정보 추출 하는 것이다. 또한 최근 대한민국 정부는 개인정보보호 법령 개정안 시행(2020, 8.5.)을 통해서 추가 정보를 사용하지 않고 가명 처리한 정보의 경우 정보주체의 동의 없이 통계작성 및 과학적 목적으 로 자료 연계가 가능해졌다. 자료 연계의 해외 사례는 호주 통계청에서는 2006년 인구센서스와 2011년 인구센서스 자료를 연계하였고, 캐나다 통계청에서는 행정 및 조사데이터(보건, 교육, 소득 데이터 등)를 연계해서 활용하고 있다. 또한 독일에서는 다양한 자료 연계를 활용하고 있다. 국내에 서는 통계청, 한국은행, 한국문화관광연구원, 통계개발원 등에서 자료 연계를 진행하고 있다. 본 논 문에서 통계청의 인구주택총조사 자료와 한국신용정보 자료 연계 통한 가계부채DB의 특성과 국민 건강보험의 한국의료패널자료와 공단급여자료를 연계를 통한 DB를 구축구축 현황은 기반으로 사 례 분석 결과를 제시한다.
SS-IV-3-3
A practical guide for linking survey and administrative data
Summary: Record linkage is a popular technique for combining multiple data sources in a single data frame. This technique can be directly applied to link survey and administrative records to enhance data quality and increase data utility of initial survey data. Many record linkage methods use EM based statistical models and most successful methods of record linkage presume bi-partite situation in which there exists at most one true record pair between two datasets. However, current statistical methods cannot be often applied to real datasets in practice because bi-partite condition is not generally satisfied, and it is difficult to asymmetric cost in classification errors. In this presentation, we will discuss about a practical alternative approach which uses distance-based algorithms suited for multi-partite links. The proposed algorithm is applied to combine Korea Health Panel dataset and administrative data obtained from National Pension Service.
SS-IV-4-1
Lifting scheme for streamflow data in river networks
*박선철(충북대), 오희석(서울대)
Summary: In this presentation, we suggest a new multiscale method for analyzing water pollutant data located in river networks. The main idea of the proposed method is to adapt the conventional lifting scheme, reflecting the characteristics of streamflow data in the river network domain. Due to the complexity of the data domain structure, it is difficult to apply the lifting scheme to the streamflow data directly. To solve this problem, we propose a new lifting scheme algorithm for streamflow data that incorporates flow-adaptive neighborhood selection, flow proportional weight generation, and flow-length adaptive removal point selection. A nondecimated version of the proposed lifting scheme is also suggested. We will provide a simulation study and a real data analysis of water pollutant data observed on the Geum-River basin in South Korea.
SS-IV-4-2
An algorithm to compare two-dimensional footwear outsole images using maximum cliques and speeded-up robust feature
*박소영(부산대), Alicia Carriquiry(Iowa State U.)
Summary: Footwear examiners are tasked with comparing an outsole impression (Q) left at a crime scene with an impression (K) from a database or from the suspect's shoe. We propose a method for comparing two shoe outsole impressions that relies on robust features (speeded-up robust feature; SURF) on each impression and aligns them using a maximum clique (MC). After alignment, an algorithm we denote MC-COMP is used to extract additional features that are then combined into a univariate similarity score using a random forest (RF). We use a database of shoe outsole impressions that includes images from two models of athletic shoes that were purchased new and then worn by study participants for about 6 months. The shoes share class characteristics such as outsole pattern and size, and thus the comparison is challenging. We find that the RF implemented on SURF outperforms other methods recently proposed in the literature in terms of classification precision. In more realistic scenarios where crime scene impressions may be degraded and smudged, the algorithm we propose—denoted MC-COMP-SURF—shows the best classification performance by detecting unique features better than other methods. The algorithm can be implemented with the R-package shoeprintr.
SS-IV-4-3
Multiscale Methodology for Graph Signals
*최규빈(전북대), 오희석(서울대)
Summary: In this study, we developed a multi-scale methodology suitable for analyzing signals defined in the network domain. The network domain can be expressed as a graph with nodes and edges, and this is the domain of the signal to be analyzed. A signal whose domain is a graph is called a non-Euclidean data. In this study, we review the similarity between nodes in the graph domain, propose a new measure, and compare the corresponding embedding techniques based on the distance matrix. In addition, we propose a multiscale analysis that decomposes the signal into various modes using the graph Laplacian and the graph Fourier transform
SS-IV-5-1
Group-constrained latent Dirichlet allocation for fashion item recommendation
김성휘(포항공대), 이제용(포항공대), 최준혁(포항공대), 조종현(삼성물산), 박경호(삼성물산), *채민우(포항공대)
Summary: Recent advances in machine learning have provided valuable tools for constructing various recommendation systems in e-commerce companies such as Amazon and eBay. This paper analyzes click history records from an online fashion mall using a well-known Bayesian topic model, the latent Dirichlet allocation (LDA). Although LDA has popularly been used in the recommendation, a naive algorithm based on the LDA in fashion item recommendation may yield a crucial issue. For a customer who clicked pants primarily, for example, the algorithm tends to recommend pants only. Given a click history of pants, a more desirable algorithm would recommend fashion items compatible with the clicked pants, such as T-shirts, jumpers, and shoes. For this purpose, we propose an algorithm based on a novel Bayesian model, called the group-constrained LDA, which can incorporate prior information about the item groups. The proposed method is applied to the click history data from SSF Shop, one of the largest online fashion malls in South Korea
SS-IV-5-2
Penalized logistic regression using functional connectivity as covariates with an application to mild cognitive impairment
정재환(충남대), 지성진(충남대), Hongtu Zhu(U. of North Carolina), Joseph G. Ibrahim(U. of North Carolina), Yong Fan(U. of Pennsylvania), *이은지(충남대)
Summary: We develop a pipeline to refine brain functional connectivity (FC) as proper covariates in a penalized logistic regression model and classify normal and Alzheimer’s disease (AD) susceptible groups. Three different quantification methods are proposed for FC refinement. One of the methods is dimension reduction based on common component analysis (CCA). We applied the proposed pipeline to the Alzheimer’s Disease Neuroimaging Initiative (ADNI) data and deduced pathogenic FC biomarkers associated with AD susceptibility. We demonstrated that a model using CCA performed better than others.
SS-IV-5-3
Variational Bayes method for ordinary differential equation models
*양현주(삼성SDS), 이재용(서울대)
Summary: Ordinary differential equations (ODEs) have been used in many application areas with their intuitive appeal to modeling. Despite the wide usage, the frequent absence of their analytic solutions makes it challenging to estimate ODE parameters from the data, especially when the model has lots of variables and parameters. This paper proposes a Bayesian ODE parameter estimating algorithm which is fast and accurate even for somewhat large models. The proposed method approximates an ODE model with a state-space model based on equations of a numeric solver. It allows fast estimation by avoiding computations of a complete numerical solution in the likelihood. The posterior is obtained by a variational Bayes method, more specifically, the approximate Riemannian conjugate gradient method (Honkela et al., 2010), rather than Markov chain Monte Carlo (MCMC). In comparison experiment with existing methods, the proposed method showed the best performance in the reproduction of the true ODE curve with strong stability as well as the fastest computation, especially in a large model with more than 30 parameters. As a real-world data application, a SIR model with time-varying parameters was fitted to the COVID-19 data, and more than 50 parameters were adequately estimated for each country.
기획 세션 TOP
D-Ⅰ-1-1
공공 빅데이터 활용효과 제고를 위한 실증적 제언
*김종윤(NICE 평가정보), 이봉원(NICE 평가정보)
Summary: 최근 중앙, 지방 정부 및 공공, 민간 기업에서 가능한 의미 있고 다양한 내외부 데이터를 수 집하고 정제하여 이를 정책적, 비즈니스적 의사결정에 활용하려는 노력이 활성화되고 있다. 빅데이 터 기반의 의사결정은 지역, 정책, 분야의 다양한 데이터 수요에 대응하여 데이터 생태계를 조성하 고 공공 서비스를 개선하며 사회문제를 해결하는 데 기여한다. 또한 국가통계 데이터의 수집, 축적 및 분석을 통해 미래를 예측하고 대안을 제시하는 데 객관적인 판단 기준으로 작동할 것이다. 본 연 구에서는 최근 활성화되고 그 필요성이 높아지고 있는 공공분야 빅데이터 기반 의사결정의 사례를 통해, 그 한계와 문제점을 살펴보고, 정책적/기술적 대안을 제시해 보고자 한다.
D-Ⅰ-1-2
공공부문의 빅데이터 활용 촉진을 위한 데이터 거버넌스 개선방향
Summary: “데이터는 태양 광선과 같이, 어디에나 존재하고 모든 것의 기초가 될 것이다.” 영국의 시사 지 이코노미스트의 표현이다. 디지털 전환 시대에 데이터는 기업은 물론 공공서비스 영역에서도 투 명성과 책임성 향상에 도움을 주고 있다. ‘데이터 경제(Data Economy)’시대에 공공부문은 빅데이터 의 활용에 주목하고 있다. 미국, 영국, 일본 등 주요국은 공공부문의 빅데이터 활용을 위해 법률을 제정하고 관련 제도를 경쟁적으로 추진하고 있다. 공공부문에서 빅데이터는 정책 의사결정의 주요 원천으로 활용되고 있으며 특히 국가통계를 생산하는 통계청도 빅데이터를 활용한 다양한 통계 생 산 방법을 시도하고 있다. 본 발표는 주요국의 공공 영역에서의 빅데이터 활용 현황과 촉진 사례를 살펴보고 이를 위한 거버넌스 개선 방향을 모색해본다.
D-Ⅰ-1-3
모바일 헬스케어 상담알고리즘 개발
*한상태(호서대), 강현철(호서대), 연규필(호서대), 최호식(서울시립대)
Summary: 보건소 모바일 헬스케어 시범사업은 기존의 시⋅공간적 한계를 극복함으로써 보건소 방문 을 최소화하고, 모바일 앱(APP)과 디바이스를 통해 개인의 건강생활실천정보를 실시간으로 수집 하여 맞춤형 건강관리 서비스를 제공하고자 함이 목적이다. 16년도에 1차 시범사업으로 10개 보건 소를 통해 그 효과를 입증하였고, 이에 따라, 1차년도 시범사업 모형이 전국 확산 시에도 적용가능 한지 검증하고자 35개 보건소로 확대하여 운영하고, 향후 전국으로 확산될 수 있는 기반을 마련하 고자 하고 있다. 보건소 전문 인력들이 대상자의 활동 정보를 영역별(건강, 신체활동, 영양)로 일일 이 분석하여 매달 맞춤형 상담 코멘트를 제공해야하므로 업무과중 및 인력 부족 현상이 심각한 상 황이다. 또한, 보건소별 전문 인력의 성향이나 역량에 따라 대상자 맞춤 상담 코멘트의 품질 수준 차 이가 크게 발생하여 우수 상담코멘트를 중심으로 코멘트의 품질을 향상시킬 필요성이 시급히 대두 되었다. 이러한 상황에 대해 2016년부터 축적된 건강정보 데이터와 양질의 상담코멘트 내용에 대한 분석을 통해 영역별 표준화된 상담 알고리즘을 개발하여 모바일 헬스케어 사업의 질적 고도화를 이 루고자 한다.
D-Ⅰ-2-1
Multi-step Double Barrier Options
이항석(성균관대), 정힘찬(Simon Fraser Univ.), *이민하(성균관대)
Summary: In this article, we study double barrier options where the upper and lower boundaries are piecewise constant functions with arbitrary number of steps. We provide explicit formulas to price such types of options. On top of its applicability via generalized formulas, it is also shown that multi-step double barrier options can be applied to approximate the prices of options with arbitrary shapes of double barriers. Finally, numerical studies are provided to show validity and applicability of our theoretical ndings in practice as well.
D-Ⅰ-2-2
Neural Credibilty
*안재윤(이화여대), Rosy Oh (KAIST), Yang Lu(Cocordia U.), Dan Zhu(Monash U.), 박경배(강원대)
Summary: In insurance, the ratemaking process based on the credibility is widely used. In the traditional credibility theory, the premiums are set to be the affine function of the claim history which facilitate the transparency in the ratemaking process. However, the affine restriction on the premiums may lead to inefficiency in the ratemaking process depending on the functional form of the actual forecasting. Here, we propose the concept of neural credibility where the credibility factors are modelled with neural network. This method is interesting in that it allows the intuitive interpretation while as efficient as the forecasting based on the classical neural network methods. Simulation study and real data analysis are accompanied to show the performance of the proposed method.
D-Ⅰ-2-3
Predicting policy holders’ lapse behavior in life insurance based on clustering
이항석(성균관대), 이삭(U. of Iowa), 김기춘(미래에셋생명보험), *이가은(성균관대)
Summary: Managing lapse risk is important task as a massive lapse can threaten the life insurer’s liquidity with loss of potential future profits. In this paper, we present a modeling approach for predicting an individual’s lapse behavior with an unsupervised learning technique. A data set includes information on policies and policyholders’ characteristics such as gender and age. Also, insurance agents’ information is considered as a factor in grouping individuals. To identify the policyholders’ behavior under insurance agents’ information, we perform a traditional logit model with a clustering technique for lapse prediction. The performance of the model is tested with perplexity which indicates the predictive accuracy of the model.
D-Ⅰ-3-1
국내 의학통계 협력연구 발전을 위한 제안
Summary: 국내 대형병원을 중심으로 임상과학 연구에 필요한 의학통계업무를 지원하는 석박사급 통 계학자, 역학자가 분포하고 있다. 이 발표에서는 의학통계업무를 지원하는 통계학자로서 겪는 어려 움과 그 원인을 다각도로 살펴보고 연자 개인이 경험한 다른 나라의 경우와 비교해본다. 이러한 어 려움을 극복하고, 보다 발전적이고 통계학자와 임상연구자 양측에 서로 도움이 되는 협력 연구를 하기 위한 첫 단계로, 국내의 의학통계학자들이 서로의 경험을 공유하고 새로운 가능성을 토의할 수 있는 의사소통의 장을 마련할 것을 제안한다.
D-Ⅰ-3-2
Concordance Indexes with Left-Truncated and Right-Censored Data
Nicholas Hartman(U. of Michigan), *김세희(아산병원), Kevin He(U. of Michigan), John D. Kalbfleisch(U. of Michigan)
Summary: In the context of time-to-event analysis, a primary objective is to model the risk of experiencing a particular event in relation to a set of observed predictors. The Concordance Index (C-index) is a statistic frequently used in practice to assess how well such models discriminate between various risk levels in a population. However, the properties of conventional C-Index estimators, when applied to left-truncated time-to-event data, have not been well-studied, despite the fact that left-truncation is commonly encountered in observational studies. We first show that the limiting values of the existing C-Index estimators depend on the underlying distribution of truncation times, which can result in a misleading interpretation of model performance. We then develop a new C-Index estimator based on Inverse Probability Weighting (IPW) that corrects this limitation. The proposed IPW estimators are highly robust to the underlying truncation distribution and often numerically outperform the conventional methods.
D-Ⅰ-3-3
Nonparametric Bayesian Poisson hurdle random effects model: an application to temperature-suicide study
박진수(KAIST), 심기성(KAIST), *정연승(KAIST)
Summary: In environmental epidemiology, a short-term association between suicide and temperature has been investigated by analyzing daily time-series data collected from multiple locations. To examine the between-location heterogeneity in the association, a two-stage meta-analysis has been conventionally used. However, such approach has several limitations. First, Poisson distribution assumption may be limited because zeros are often excessively observed. Second, the two-stage approach may not properly account for the statistical uncertainty arising in the first stage. Third, the second stage meta-regression assumes normality, which is limited to flexibly describe the between-location heterogeneity. This research proposes Bayesian Poisson hurdle mixed effects models to explore the heterogeneity in the temperature-suicide association. The Poisson hurdle model consists of two parts; a binary part and a positive part. That way, we can have more flexibility to deal with the inflated or deflated zeros for the number of suicide. It also allows for examining the temperature effect both on the binary and positive part, separately. We include location-specific random coefficients to represent the heterogeneity in the temperature-suicide association. The random effects of both parts are modeled jointly to induce correlations and assumed to follow a Dirichlet process mixture of normals to relax the normality assumption. For a fully Bayesian inference, we implement a Markov Chain Monte Carlo sampling. The proposed methodology was validated through a simulation study and applied to analyze the data from a temperature–suicide association study in Japan.
D-Ⅰ-4-1
조사통계학 발전의 이론적 측면과 향후 전망
*김규성(서울시립대), 이효정(서울시립대),박유진(서울시립대), 장동민(서울시립대)
Summary: 19세기 후반 키에르(Kiaer, 1894)에 의해 제시된 ‘대표성 있는 표본’ 개념은 현대적 표본조사 의 서막을 알리는 신호탄이었다. 이로 인해 모집단 전수조사에서 표본조사로의 전이가 가능해졌다. 표본 대표성의 개념이 확률표본으로 정착되기까지는 30여 년의 시간이 필요했다. 네이만(1934)이 제시한 표본추출 확률분포에 기초한 일치성 있는 모수 추정 개념과 호르비쯔-톰슨(HT, 1952)이 제 시한 HT 추정량은 1930년대 이후 표본조사의 국제적 표준의 디딤돌이 되었다. 1980년대 들어 증가 하기 시작하는 무응답 문제 해결에는 루빈(1976)의 확률무응답 개념이 큰 역할을 하였고, 믿을만한 외부 자료를 이용하여 조사결과의 안정성을 높이는 데에는 션들 등(1993)의 사후 가중치 조정법이 큰 역할을 하였다. 2000년대 들어 비확률표본 사용의 증가, 무응답률 증가, 조사비용의 증가, 웹 등 인터넷 조사의 증가, 행정자료 및 빅데이터 이용 증가 등 조사환경이 급격히 변화하여 확률표본에 기초한 고전적인 조사통계 이론은 해결하기 힘든 도전에 직면해 있다. 조사통계학 이론은 당면한 문제를 해결해가면서 진화를 거듭하여 더 큰 영역의 이론체계로 자리를 잡을 것인가, 아니면 토마 스 쿤(1962)의 용어를 사용하면, 새로운 유형의 패러다임이 기존의 조사통계학 패러다임을 대체할 것인지는 아직 불분명하다. 그러나 조사통계학 분야에서는 조사의 수요가 증가할 때 이에 부응하는 이론이 개발되고 체계화되어온 점을 감안하면 어떤 형태이든 조사통계학 이론은 끊임없이 개발되 고 체계화될 것으로 전망된다.
D-Ⅰ-4-2
조사업계에서 바라본 조사방법의 과거, 현재, 그리고 미래-언택트 조사를 중심으로-
Summary: 코로나 팬데믹으로 인해 우리들 둘러싼 사회, 경제적 환경들이 급속하게 변화하고 있습니 다. 오프라인 매장 이용감소, 이른 귀가로 인한 야간 상권 축소로 인한 전반적인 경기 침체 속에서도 모바일을 포함한 온라인 쇼핑은 증가하는 등 일상생활 속에서도 컨택트에서 언택트로의 전환을 쉽 게 찾아 볼 수 있습니다. 특히, 모바일 기기를 활용하여 언제 어디서나 쉽게 접촉할 수 있는 인터넷 환경과 타인과 접촉을 꺼리는 대인 관계 등이 전환을 더욱 가속화 시키고 있습니다. 더불어, 통계를 직접 생산⋅가공하는 조사업계도 코로나19 이후로 온라인 조사를 중심으로 다양한 언택트 방식의 조사방법이 급격하게 증가하고 있습니다. 본 연구에서는 (1) 조사방법의 시대적 변화와 (2) 현재 적 용되고 있는 언택트 조사의 사례, (3) 그리고 미래를 예상해 보고자 합니다.
D-Ⅰ-4-3
국가통계 조사방법의 변화와 전망
Summary: 전통적 조사방법인 대면조사를 기반으로한 국가통계 생산방식은 사회구조 변화 등 다양한 요인으로 인해 어려운 상황에 놓여있다. 통계청은 단일 행정자료를 활용하여 통계를 작성하는 방법 외에도 다양한 행정자료의 융⋅복합을 통해 통계를 작성하고 있으며 데이터의 융⋅복합에 있어서 는 그 대상 및 범위를 점차 넓혀가고 있다. 한편 날로 심화되는 대면방식 현장조사의 문제점을 해결 하기 위해 5개 광역권에 위치한 지방통계청에 비대면조사방법을 도입하여 그 가능성을 확인하였으 며 현재 집중형 비대면조사센터를 신설하여 관련 통계를 이관하는 사업이 진행 중이다. 마지막으로 통계청은 4차 산업혁명 시대에 걸맞는 데이터 가치 창출을 위해, 2025년을 목표로 한국판 국가통계 체계인 ‘K-통계체계’를 구축할 계획이며 이에 따라 향후 국가통계 생산방법은 지속적으로 긍정적 방향으로의 변화를 예고하고 있다.
D-Ⅰ-5-1
통계학에서의 기계학습과 인공지능 교육
Summary: 기계학습, 인공지능과 통계학 사이의 관계에 대하여 살펴보고, 효과적인 교육 방법에 대하 여 살펴본다. Bagging, Boosting, SVM, DNN 등의 방법을 통계학적 관점에서 살펴보고, 통계학 교육 에서 이를 수용하기 위한 방법, 통계학 교육 전반의 재구조화 등의 문제에 대하여 살펴보게 된다.
D-Ⅰ-5-2
머신러닝, 딥러닝, 강화학습 그리고 통계학
Summary: 전통적 통계모형과 AI(머신러닝, 딥러닝, 강화학습)는 공통적으로 모형을 설정하고 모형의 모수를 추정하기 위한 손실함수의 최소화를 목적으로 하고 있다. 조건부 확률분포와 로그우도함수 를 이용하여 두 분야가 사실상 같은 모형과 손실함수를 가지고 있다는 것을 논의하고자 한다. 모형 구조와 추정의 관점에서 공통점과 차이점을 살펴보고 AI적 접근 방법의 성과를 논의하고자 한다. 이러한 논의를 기반으로, AI 시대에 무엇을 준비해야 하나? 에 대한 답변을 고민해 보고자 한다
D-Ⅰ-5-3
딥러닝에 기반한 생존분석 접근법
*하일도(부경대), Hao Lin(부경대), 김지훈(부경대), 권숙희(부경대)
Summary: 최근 딥러닝 (deep learning)은 다양한 분야의 예측이나 분류 문제에 괄목할 만한 결과를 제 공해 주고 있다. 특히 심층 신경망 모형인 딥러닝 모형은 기존의 통계적 모형에 대한 하나의 일반화 로 볼 수 있으며, 그 형태는 매우 복잡한 비선형적인 비모수적 통계 모형으로 구성되어져 있다. 이에 따라 예측 및 분류를 위해 표준적인 통계적 모형 (예: 고전적 회귀모형, 일반화 선형모형)이 딥러닝 모형에 많이 적용되어져 오고 있다. 하지만 생존시간자료 (survival time data or time-to-event data) 분 석 문제에 대해서는 중도절단성 (censoring)으로 인해 딥러닝 기법의 적용은 상대적으로 덜 연구 되 어져 왔다. 본 발표에서는 생존분석에서 딥러닝 기법을 적용한 최근의 리뷰와 그 적용의 한계점을 먼저 소개한 후, 생존분석에서 자주 사용되는 Cox의 비례위험모형에서 딥러닝 접근법의 연구 결과 를 제시하고자 한다. 고차원 실제 생존자료 (high-dimensional real survival data)를 통해 제안된 방법 과 기존의 머신러닝 생존분석 방법 (예: Cox-based LASSO model, random survival forest)을 예측오차 (prediction error; concordance index, time-dependent Brier score and AUC)측면에서 비교하고자 한다. 마지막으로 생존분석을 위한 딥러닝 모형의 확장을 토론하고자 한다.
D-Ⅰ-6-1
경제활동인구조사 자료를 위한 다중대체 방식 연구
*김정연(연세대), 배윤종(통계청), 박민정(한미약품)
Summary: 경제활동인구조사는 고용 관련 통계를 생성하는 국가조사로서, 국민의 경활상태 (취업/실 업/비경활)를 파악하는 것이 주요 목적이다. 정확한 통계를 내기 위해 무응답률을 낮추는 것이 중요 하고, 이미 발생한 무응답을 보완하기 위한 방법으로 무응답 대체가 가능하다. 경제활동인구조사는 응답 방식이 순차적 흐름을 따라가기 때문에 구조적인 무응답이 존재한다. 또한 전체 가구원내 무 응답 항목이 하나라도 있으면 해당 가족 구성원 전체를 무응답 처리하기에 최종 자료에는 항목 무 응답이 아닌 단위 무응답만 존재한다는 특징이 있다. 본 연구에서는 구조적 무응답 이해 및 연계자 료를 통한 과거 자료의 활용 등을 통해 기존의 방법보다 효과적인 무응답 대체 모형을 제시하고자 한다. 대체 모형의 성능을 일치도/비일치도를 기반으로 평가한다. 이를 위해, 2019년 11월 경제활동 인구조사 자료를 기반으로 모의실험을 실시한다. 총 59,996명의 응답자 중 일부를 랜덤하게 선택한 뒤, 경활상태를 판정하는데 결정적인 설명변수 6개와 경활상태를 무응답 처리한다. 기존 무응답 대 체 모형에서 사용하였던 설명 변수 이외에 산업변수와 종사상지위 변수를 추가함으로써 모형을 개 선한다. 이는 과거자료의 연계 및 활용을 가정한 것으로, 기존의 모형모다 성능이 향상되는 것을 확 인한다. 또한, 경활상태별 무응답자 수에 대한 다양한 시나리오를 고려한다.
D-Ⅰ-6-2
Minimax bounds for estimating multivariate Gaussian location mixtures
*김경희(고려대), Adityanand Guntuboyina(U. of California)
Summary: We prove minimax bounds for estimating Gaussian location mixtures on R^d under the squared L^2 and the squared Hellinger loss functions. Under the squared L^2 loss, we prove that the minimax rate is upper and lower bounded by a constant multiple of n^{-1}(log n)^{d/2}. Under the squared Hellinger loss, we consider two subclasses based on the behavior of the tails of the mixing measure. When the mixing measure has a sub-Gaussian tail, the minimax rate under the squared Hellinger loss is bounded from below by (\log n)^{d}/n. On the other hand, when the mixing measure is only assumed to have a bounded p^th moment for a fixed p > 0, the minimax rate under the squared Hellinger loss is bounded from below by n^{-p/(p+d)}(log n)^{-3d/2}. These rates are minimax optimal up to logarithmic factors.
D-Ⅰ-6-3
Tensor Canonical Correlation Analysis
Summary: Canonical correlation analysis (CCA) is a multivariate analysis technique for estimating a linear relationship between two sets of measurements. Modern acquisition technologies, for example those arising in biomedical area such as neuroimaging produce data in the form of multi-dimensional arrays, also as known as tensors. Classic CCA is not appropriate for dealing with tensor data due to the multidimensional structure and ultra high-dimensionality of such modern data. In this paper, we propose tensor CCA (TCCA) to discover relationships between two tensors. TCCA finds a pair of loading tensors while simultaneously preserving multi-dimensional structure of the tensors and utilizing substantially fewer parameters. Results of simulation studies will be discussed to illustrate the advantage and performance of our methods.
D-Ⅰ-7-1
농산물소득조사 간접추정 적용 가능성 검토
*임찬수(통계청), 김경미(통계청), 권순필(통계개발원)
Summary: 농산물소득조사는 작목별 소득을 분석하여 농업경영체 경영진단 및 설계 등 농가소득 증대 를 위한 농업경영 연구와 경영개선 지도를 위해 기초자료 제공을 목적으로 매년 수행된다. 2015년 기준 통계승인 58작목과 약 114개 미승인작목에 대해 5년간 매월 표본 농가를 대상으로 조사를 수 행함에 따라 농가 측면에서는 응답 부담이 높으며, 운영 측면에서는 비용부담과 함께 수행난이도가 매우 높은 조사이다. 조사환경의 악화 및 조사 효율화 등을 고려한다면 응답 부담 및 조사비용 축소 를 위한 노력이 필요한 상황이다. 이러한 배경에서 본 연구는 농산물소득조사에 대한 간접추정 가 능성을 검토하고자 한다. 모든 작목에 간접추정 방안 및 적용가능성을 검토하는 것은 한계가 있기 에 특정 작목을 선정하고 간접추정 방안 제시 및 적용한다. 그리고 간접추정 결과와 기존 직접조사 결과 비교를 통해 간접추정 적용 가능성을 살펴보고자 한다.
D-Ⅰ-7-2
인지면접을 적용한 조사표 설계 연구: ‘한국인의 행복조사’ 사례
Summary: 조사표는 응답자로부터 응답을 도출하는 일차적인 도구로서 조사의 정확성과 통계품질에 직접적인 영향을 미칠 수 있는 점에서 중요하다. 인지면접(cognitive interviewing)은 응답자 관점에 서 조사표를 평가하는 방법으로, 응답자의 질문 이해와 응답과정을 파악함으로써 응답상황을 예측 하고 사전에 문제를 진단하여 측정오차 감소 및 응답자 친화적인 조사표 설계에 도움을 줄 수 있다. 최근 새로 개발된 ‘한국인의 행복조사’는 조사표 설계과정에 인지면접 방법을 적용한 사례이다. 이 조사는 조사표 초안과 수정안을 검토하는 인지면접을 두 차례 순차적으로 실시하였다. 그 결과 응 답자가 이해하기 애매한 질문, 혼동되는 응답기준 등 검토가 필요한 지점을 도출하였고, 조사표 구 성에 기초자료로 활용되었다. 이 사례를 통해 조사표 설계과정에서 인지면접 적용에 대한 시사점을 얻을 수 있다.
D-Ⅰ-7-3
머신러닝 기반 통계분류 자동코딩 시험분석
Summary: 4차 산업혁명 시대에 빅데이터를 IT 기술과 연계하여 신속⋅정확하게 정보를 창출하고 활용하 는 데이터 과학(Data Science)의 중요성이 강조되고 있으며, 최근에는 인공지능(AI)을 중심으로 데이터 과학 방법론 논의가 집중되고 있다. 국가통계 영역에서도 인공지능 기술을 적용하기 위한 다양한 노력 이 경주되고 있는바, 본 연구에서는 ‘머신러닝 활용 통계분류 자동화 방법론’을 모색하고 통계조사 자료를 이용하여 머신러닝 기법 적용시의 ‘분류 정확도 시험분석’을 진행하였다. 학습데이터로부터 비지도 학습 기반의 색인어추출 모델을 학습하고, 색인어 기반 문장 임베딩 모델을 통해 입력 벡터를 추출한 후, 결과 분류 코드를 인코딩하여 지도학습 모델에서 학습하는 방법론을 적용하였다. 이를 통해 적은 학습 시간과 빠른 분류 수행 성능을 보이는 기계학습 모델을 구현하였고, 기존 시스템의 분류 결과 대비 비교적 높은 정확도를 보임을 확인하였다. 이와 같은 연구 내용은 향후 ‘통계 도메인에 특화된 비정형 데이터 처리 허브 시스템 구축’ 등에 주요하게 활용될 수 있을 것으로 기대한다.
D-Ⅰ-8-1
Interval-censored rank regression
최태화(고려대), *최상범(고려대)
Summary: We propose the rank-based inference procedures for the semiparametric accelerated failure time model with interval-censored data. This type of data commonly occurs in biomedical longitudinal studies, when periodic examination is inevitable but the event-of-interest is initially asymptomatic. The estimating equation is simply constructed by investigating whether the pair of observed error terms is comparable, and this equation can be generalized by adopting data-dependent weights. For statistical inference, an efficient resampling procedure is considered, which is much faster than classical bootstrap. Asymptotic properties of the proposed estimators are well established with the empirical processes. Furthermore, we develop the one-step efficient estimation by allowing semiparametrically efficient weight, which depends on unknown distribution function. To this end, the EM-based nonparametric maximum likelihood estimation is also considered. The proposed methods show remarkable performance in finite sample studies, and practical usage is illustrated with HIV/AIDS cohort study.
D-Ⅰ-8-2
Efficient Sample Allocation by Local Adjustment for Unbalanced Ranked Set Sampling
*안수현(아주대), Xinlei Wang(Southern Methodist U.), 임요한(서울대)
Summary: In this paper, we propose a procedure to add samples to make the design be more efficient than the balanced ranked set sampling (RSS) of equal sample size, when the current design is not. To do it, we first find a sufficient set N of allocations under which unbalanced RSS is more efficient than balanced RSS in estimating the population mean and illustrate the set N with the set size H = 3. With N, we propose two procedures to add samples to make the design be more efficient than the balanced RSS of equal set and sample size, when the current design is not. We numerically investigate their performances under various settings and apply them to re-designing unbalanced RSS due to the missingness of balanced RSS.
D-Ⅰ-8-3
Mixture of Linear Models Co-Supervised by Deep Neural Networks
*서범석(한국은행), Lin Lin(펜실베니아주립대), Jia Li(펜실베니아주립대)
Summary: 인공신경망(deep neural networks) 모형은 다양한 연구 분야에서 높은 예측 정확도를 보여준다. 그럼에도 불구하고 인공신경망 모형은 해석의 어려움으로 인해 금융, 의료, 행정 등 의사결정 결과의 영향이 큰 분야에서 활용이 제한되고 있다. 따라서 인공신경망 모형을 해석하기 위한 방법론을 개발할 필요성이 크다. 본 연구에서는 인공신경망 모형의 높은 정확도를 유지하면서 해석이 용이한 새로운 혼합선형모형(mixutre of linear models)을 제시하였다. 이를 위해 인공신경망 모형을 최적 모형으로 가정하고 동 모형을 조각별로 선형근사함으로써 혼합선형모형을 효율적으로 추정하였다. 혼합선형 모형의 해석을 위하여 시각화 방법과 정량적 접근을 이용한 해석 방법을 제시하였다. 시뮬레이션 분석 과 실증분석을 통하여 혼합선형모형의 해석이 유용함을 보여주었고, 동 모형이 다른 해석이 가능한 모형에 비해 상대적으로 높은 정확도를 달성함을 보여주었다. 인공신경망 모형을 근사하여 추정한 혼합선형모형은 해석이 용이하지만 정확도가 다소 낮은 선형 통계모형과 높은 정확도를 보이지만 해 석이 어려운 인공신경망 모형 사이에서 그 간극을 메우는 방법론으로 역할을 할 것으로 기대한다.
D-Ⅰ-9-1
인공지능(Artificial Intelligence)을 이용한 산업 혁신 사례와 전략
Summary: AI 알고리즘과 소프트웨어는 사람의 인지능력과 비슷한 수준으로 발전해가며, 다양한 산 업분야의 복잡한 데이터 분석에 활용되고 있다. 오늘날 산업 현장에서 분석의 역할은 단지 모형개 발이나 스코어링에 머물지 않고, 실질적인 비즈니스 문제에 대한 합리적 추론과 의사결정에 기여하 며, 많은 가치를 창출하고 있다. 본 세션에서는 산업분야에 AI를 효과적으로 적용하기 위한 분석의 4단계 전략을 소개하고, 제조, 병원, 헬스 산업의 국내외 사례 (볼보, 암스테르담 병원, 지아이비타 등)를 소개한다.
D-Ⅰ-9-2
클라우드 환경하에서 COLAB, SAS VIYA, SAS ODA 통합 분석 시스템 구축 사례
Summary: 본 발표는 주피터 환경이 보여주었던 마크다운과 다양한 통계 커널의 연결이 코랩 환경에 서 새롭게 발전하고 있음을 지적하고자 한다. 특히 SAS가 발전시켜 온 클라우드 환경하에서의 ODA 및 Viya 플랫폼이 코랩 환경과 결합하여 SAS 활용 가능성을 비약적으로 높일 수 있다는 점에 주목하고 이를 어떻게 기술적으로 구현할 수 있을지를 구체적으로 제시하고자 한다. 마지막으로 코 랩 환경에서 파이썬과 SAS 데이터 간의 상호 변환 및 웹 크롤링 기법을 시현하면서 SAS 기반의 Open API 분석의 유용성을 제시하고자 한다.
D-II-1-1
Causal Clustering
*Kwangho Kim(Harvard U.), Jisu Kim(Inria Saclay), Larry A. Wasserman(Carnegie Mellon U.)
Summary: We develop Causal Clustering, a new set of methods for exploring treatment effect heterogeneity that leverages tools from clustering analysis. We develop an efficient way to uncover subgroup structure in conditional treatment effects by harnessing widely-used clustering methods. We show that k-means, density-based, and hierarchical clustering algorithms can be successfully adopted into our framework via plug-in estimators, and give rates of convergence showing the additional cost of estimating nuisance outcome regressions. Further, for k-means causal clustering, we develop a specially bias-corrected estimator based on nonparametric efficiency theory, which attains fast convergence rates to the true cluster centers under weak nonparametric conditions on nuisance function estimation. This requires novel techniques due to the non-smoothness of the minimizer of the k-means risk. We also give conditions for asymptotic normality of the cluster centers. Our work leads to novel tools that are especially useful for modern outcome-wide studies with many treatment levels. We illustrate the methods via simulation studies and real data analyses.
D-II-1-2
Fast and flexible estimation of effective migration surfaces
Joseph H. Marcus(U. of Chicago), *Wooseok Ha(UC Berkeley), Rina Foygel Barber(U. of Chicago), John Novembre(U. of Chicago)
Summary: An important feature in spatial population genetic data is often “isolation-by-distance,” where genetic differentiation tends to increase as individuals become more geographically distant. Recently, Petkova et al. (2016) developed a statistical method called Estimating Effective Migration Surfaces (EEMS) for visualizing spatially heterogeneous isolation-by-distance on a geographic map. While EEMS is a powerful tool for depicting spatial population structure, it can suffer from slow runtimes. Here we develop a related method called Fast Estimation of Effective Migration Surfaces (FEEMS). FEEMS uses a Gaussian Markov Random Field in a penalized likelihood framework that allows for efficient optimization and output of effective migration surfaces. Further, the efficient optimization facilitates the inference of migration parameters per edge in the graph, rather than per node (as in EEMS). When tested with coalescent simulations, FEEMS accurately recovers effective migration surfaces with complex gene-flow histories, including those with anisotropy. Applications of FEEMS to population genetic data from North American gray wolves show it to perform comparably to EEMS, but with solutions obtained orders of magnitude faster. Overall, FEEMS expands the ability of users to quickly visualize and interpret spatial structure in their data.
D-II-1-3
Robust and Scalable Gaussian Mixture Models via Kernel Embedding
*유기성 (Yale U.)
Summary: Gaussian mixture model (GMM) is one of the most popular methods for density estimation and model-based clustering. A common choice for maximizing the likelihood is the expectation-maximization (EM) algorithm whose convergence to local optima is guaranteed. However, the standard EM is known to lack computational efficiency for the large-scale data. In the spirit of divide-and-conquer strategy, we propose a two-stage estimation procedure using the geometric median of subset estimates under the metric structure induced by kernel embedding of Gaussian measure
D-II-2-1
CWS: Nurturing and Empowering Women in Statistics
Summary: The Caucus for Women in Statistics (CWS), formed in 1971, is an international, professional statistical society advocating for the education, employment, and advancement of women in statistics. Its mission is to advance the careers of women statisticians through advocacy, providing resources and learning opportunities, increasing their professional participation and visibility, and promoting and assessing research that impacts women statisticians. Its vision is a world where women in the profession of statistics have equal opportunity and access to influence policies and decisions in workplaces, governments, and communities. Its membership is open to anyone who supports CWS’s mission and vision, from academia, industry, government, and other entities. In this talk, I will provide a brief history of the 50 years of CWS, focusing on both accomplishments and challenges. I will also present a “call to action” for all to get involved in helping to build a strong pipeline for women statisticians
D-II-2-2
Bayesian approach to the multiple mixed outcomes regression: Considering multivariate component-based direct and indirect effects
최지예(York U.), *경민정 (덕성여대), 박주현(동국대)
Summary: Applications of mediation analysis have gained much attention to investigate intermediate variables in the relationship between independent and dependent variables. Methods for estimating mediation effects are available for a continuous outcome and a continuous mediator related via a linear model, while for a categorical outcome or categorical mediator, methods are usually limited to two-stage mediation. We propose a Bayesian methodology for a component-based model that accounts for unstructured residual covariances, while regression multivariate mixed outcomes on pre-defined sets of predictors and mutivariate component-based mediators. For the part of multivariate ordinal outcomes re-espresses with a set of latent continuous variables based on an approximate multivariate t-distribution. The proposed method is applied to a subset of data, extracted from the 2012 national survey on drug use and health study that investigate risk factors of nicotine (cigarette), alcohol, pain reliever and marijuana dependence.
D-II-2-3
A Statistical Method for Comparing ROC Curves of Multireaders with Standalone Artificial Intelligence
*한경화 (연세대), 김성원(연세대), 최병욱(연세대), 정인경(연세대)
Summary: Multireader multicase (MRMC) ROC curve analysis is being used to analyze the diagnostic performance for computer-aided detection and diagnosis system. The most common design for MRMC study is the fully-crossed design. When comparing between stand-alone system and multiple readings from human readers, however, the corresponding statistical methods are not well described although several open-source packages provide functions to analyze the data. This study aims to describe a bootstrap approach for comparing the diagnostic performance of stand-alone artificial intelligence (AI) system with readings from multireader.
D-III-1-1
Robust Probit Linear Mixed Models for Longitudinal Binary Data
Kuo-Jung Lee(National Cheng Kung U.), 김찬민(성균관대), Ray-Bing Chen(National Cheng Kung U.), *이근백(성균관대)
Summary: This paper describes Bayesian methods that can be used in longitudinal studies of binary outcomes that involve repeated measurements on subjects over time with drop-outs. We consider probit models with random effects to capture heterogeneity and the serial dependence. In this framework, we model the correlation matrix for serial correlations of repeated responses using hypersphere decomposition. We also consider robustness issues involved in model misspecifications for probit models. An MCMC algorithm for parameter estimation of the proposed models is presented, and simulations are performed to investigate the comparisons with other models and effectiveness of prior distributions. Simulation studies also show that the proposed approach can yield improved efficiency in the estimation of the regression parameters. Two real examples and a CRAN R package, BayesRGMM, are provided to demonstrate the proposed approach.
D-III-1-2
Lévy adaptive B-spline regression via overcomplete systems
*박세원(삼성SDS), 오희석(서울대), 이재용(서울대)
Summary: The estimation of functions with varying degrees of smoothness is a challenging problem in the nonparametric function estimation. In this paper, we propose the LABS (Lévy Adaptive B-Spline regression) model, an extension of the LARK (Lévy Adaptive Regression Kernels) models, for the estimation of functions with varying degrees of smoothness. LABS model is a LARK with B-spline bases as generating kernels. The B-spline basis consists of piecewise k degree polynomials with k−1 continuous derivatives and can express systematically functions with varying degrees of smoothness. By changing the degrees of the B-spline basis, LABS can systematically adapt the smoothness of functions, i.e., jump discontinuities, sharp peaks, etc. Results of simulation studies and real data examples support that this model catches not only smooth areas but also jumps and sharp peaks of functions. The proposed model also has the best performance in almost all examples. Finally, we provide theoretical results that the mean function for the LABS model belongs to the certain Besov spaces based on the degrees of the B-spline basis and that the prior of the model has the full support on the Besov spaces.
D-III-1-3
The Beta-Mixture Shrinkage Prior for Sparse Covariates with Posterior Minimax Rates
이경재(성균관대), *조성일(인하대), 이재용(서울대)
Summary: Statistical inference for sparse covariance matrices is crucial to reveal dependence structure of large multivariate data sets, but lacks scalable and the- oretically supported Bayesian methods. In this paper, we propose beta-mixture shrinkage prior, computationally more efficient than the spike and slab prior, for sparse covariance matrices and establish its minimax optimality in high-dimensional settings. The proposed prior consists of beta-mixture shrinkage and gamma priors for off-diagonal and diagonal entries, respectively. To ensure positive definiteness of the resulting covariance matrix, we further restrict the support of the prior to a subspace of positive definite matrices. We obtain the posterior convergence rate of the induced posterior under the Frobenius norm and establish a minimax lower bound for sparse covariance matrices. The class of sparse covariance matrices for the minimax lower bound considered in this paper is controlled by the number of nonzero off-diagonal elements and has more intuitive appeal than those appeared in the literature. It turns out that the obtained posterior convergence rate is minimax or nearly minimax. In the simulation study, we show that the proposed method is computationally more efficient than competitors, while achieving com- parable performance. Advantages of the shrinkage prior are demonstrated based on two real data sets.
D-IV-1-1
Joint change point analysis of factor and sparse autoregressive structures in high dimensions
*조해란(U. of Bristol), Idris Eckley(Lancaster U.), Paul Fearnhead(Lancaster U.), 맹혜영(Lancaster U.)
Summary: In this paper, we propose a novel approach to piecewise stationary modelling of highdimensional time series. For this, we first propose a general model that allows for both dynamic factor structure driving pervasive cross-sectional dependence, and sparse vector autoregressive structure embedding the network underlying the data after factor-driven strong cross-(auto)covariance is removed. Operating under such a model where neither of the structures are observable, we develop a change point detection methodology which jointly analyses the latent components of the data for multiple change points. We establish the consistency of the proposed methodology in estimating both the total number and the locations of the change points in the latent components. Numerical results demonstrate its good performance, and we justify the proposed modelling approach using real data applications.
D-IV-1-2
Robust Inference on Infinite and Growing Dimensional Regression
Abhimanyu Gupta(Essex U.), *서명환(서울대)
Summary: We develop a class of tests for a growing number of restrictions in infinite and increasing order time series models such as infinite-order autoregression, nonparametric sieve regression and multiple regression with growing dimension. Examples include the Chow test, Andrews and Ploberger (1994) type exponential tests, and testing of general linear restrictions of growing rank p. Notably, our tests introduce a new scale correction to the conventional quadratic forms that are recentered and normalized to account for diverging p. This correction accounts for a high-order long-run variance that emerges as p grows with sample size in time series regression. Furthermore, we propose a bias correction via a null-imposed bootstrap to control finite sample bias without sacrificing power unduly. A simulation study stresses the importance of robustifying testing procedures against the high-order long-run variance even when p is moderate. The tests are illustrated with an application to the oil regression in Hamilton (2003).
D-IV-1-3
Change-point Regularization Problem in Longitudinal Data Analysis
*박종희(서울대), Soichiro Yamauchi(Harvard U.)
Summary: Using regularization methods in longitudinal analysis can lead to erroneous inferential results when parameters change over time. In this paper, we propose a fully Bayesian approach to the change-point regularization problem by combining the Bayesian bridge model with a hidden Markov model for changepoint detection. We apply our method to the study of the relationship between government partisanship and economic growth and the relationship between food aid and the civil war on-set. In both applications, we uncover strong time-varying effects that are not addressed by original studies.
일반 세션 TOP
N-II-1-1
미래 불확실성에 따른 인구예측의 진화
Summary: 최근들어 미래 불확실성(uncertainty)이란 용어가 매채에 자주 등장한다. 불확실성은 여러 가지 상황으로 미래는 정해진 것이 아니라 불확실하다는 의미이다. 특히 인구 분야에서 고려하는 불확실성(Lee(1988, 1994), Alho(1985), Dunstan과 Ball(2016), UN(2010), 해외 인구 연구소 (MPIDR(2007), VID(2016) 등)은 확률적 생애이벤트, 예측 불확실성, 자료 측정오류, 모형과 모수 불 확실성, 정책 이벤트, 사회 구조 변화, 예기치 않은 코로나 등과 같이 8가지이다. 선행연구에 따르면 8가지 불확실성을 내포하는 미래 인구는 고위, 중위, 저위, 또는 비관, 중도, 급진적, 그리고 정책입 안자가 제시하는 목표값 등의 결정론적 인구추계(deterministic projection)보다는 확률론적 인구예 측(stochastic prediction) 접근이 합리적이다. 본 연구는 선행연구를 검토하고, 시나리오, 목표값 설 정, 확률론적 인구예측으로 발전하는 인구예측 진화 과정과 방법들을 비교한다. 그리고 통계 해석 용이와 실현가능성이 높은 결정론적 중위수준과 미래 불확실성의 분위수(percentile)를 함께 제시하 는 하이브리드(hybrid) 인구예측을 제시하고 구현 방법을 소개한다.
N-II-1-2
법의학에서 혈연식별을 위한 비혈연관계 대립형질 분포 추정 연구
*정수진(경희의료원), 이효정(동아에스티 개발본부), 이숭덕(서울대), 이재원(고려대)
Summary: 과학기술의 발전으로 인해 혈흔만으로도 개인을 특정할 수 있다. 이는 우도비(likelihood ratio)를 통해 계산되는데, 이 과정에서 비혈연관계의 대립형질 빈도(allele frequency)가 우도비에 반 영되기 때문에 어떤 대립형질 빈도를 사용하느냐에 따라 식별 결과가 민감하게 달라질 수 있다. 따 라서 한국인에 해당되는 비혈연관계의 대립형질 빈도에 관한 정확한 추정이 필요하다. 지금까지는 대립형질 빈도를 추정하기 위해 각 기관에서 같은 민족이면서 비혈연관계에 있는 독립인 대상자들 의 유전자 검사를 통해 추정하였다. 대립형질 빈도는 대표성과 충분성을 만족하기 위해 최대한 많 은 표본이 필요하나, 대상자 모집의 어려움과 경제적 한계성으로 인해 현재는 기관별로 따로 수집 된 대립형질 빈도를 메타분석처럼 통합하는 방법이 널리 사용되고 있다. 하지만 이 경우에는 유전 자형(genotype)을 알 수 없기 때문에 자칫 독립성이 저해될 수 있다. 따라서 이러한 문제를 해결하기 위해 여러 표본의 대립형질 빈도를 이용하여 하디-와인버그 평형(Hardy-Weinberg equilibrium, HWE) 를 만족하는 유전자형을 추정하고, 대립형질이 아닌 유전자형 단계에서 통합한 후 빈도를 추 정하는 방법을 제안하였다. 시뮬레이션을 통해 비교해보고 실제자료에 적용해봄으로써 제안된 방 법의 우수성을 보일수 있었다.
N-II-1-3
Spatially lagged covariate model with zero inflated Conway-Maxwell-Poisson distribution for the analysis of pedestrian injury counts
*김희영(고려대), 이수기(한양대)
Summary: Road safety has been a major issue in contemporary societies, with road crashes incurring major human and materials costs annually worldwide. Since road transport involves distances by nature, it stands to reason that spatial analyses would be considered by researchers. In simple terms, spatial dependence essentially refers to events at a locations being highly influenced by events at neighbouring locations. When count data exhibit an large proportion of zeros, it is somewhat common for the probability of zero values to not match a standard count distribution. Zero inflated models are commonly used strategies to account for the excess zeros, and they assume that data are drawn from the mixture distribution of a zero-degenerated part and a count part; zero values come from both model components. In this paper, we use spatially lagged covariate model with zero inflated Conway-Maxwell-Poisson distribution model to account for spatial autocorrelation of number of pedestrian crashes with cars.
N-III-1-1
Characterization of histone modification patterns and prediction of novel promoters using functional principal component analysis
*김미정(이화여대), Shili Lin(Ohio State U.)
Summary: Characterization of distinct histone methylation and acetylation binding patterns in promot- ers and prediction of novel regulatory regions remains an important area of genomic research, as it is hypothesized that distinct chromatin signatures may specify unique geno- mic functions. However, methods that have been proposed in the literature are either descriptive in nature or are fully parametric and hence more restrictive in pattern discovery. In this article, we propose a two-step non-parametric statistical inference procedure to char- acterize unique histone modification patterns and apply it to analyzing the binding patterns of four histone marks, H3K4me2, H3K4me3, H3K9ac, and H4K20me1, in human B-lympho- blastoid cells. In the first step, we used a functional principal component analysis method to represent the concatenated binding patterns of these four histone marks around the tran- scription start sites as smooth curves. In the second step, we clustered these curves to reveal several unique classes of binding patterns. These uncovered patterns were used in turn to scan the whole-genome to predict novel and alternative promoters. Our analyses show that there are three distinct promoter binding patterns of active genes. Further, 19654 regions not within known gene promoters were found to overlap with human ESTs, CpG islands, or common SNPs, indicative of their potential role in gene regulation, including being potential novel promoter regions.
N-III-1-2
Detecting the Granger causality in quantiles using the stationary vine copula models
장현아(숙명여대), 김종민(U. of Minnesota-Morris), *노호석(숙명여대)
Summary: The Granger causality means that the past of one time series facilitates the future prediction of another time series. The traditional Granger causality test based on the vector autoregression model has limitations in detecting the nonlinear causality. Several studies have been done to relax parametric model assumptions and provide nonparametric versions of Granger causality tests. The nonparametric Granger causality tests share the advantage of being able to detect nonlinear Granger causality, but all have the difficult problem of selecting smoothing parameters that greatly affect the detection performance. Therefore, we consider the Granger causality detection method based on the semiparametric time series modeling technique, which overcomes the shortcomings of parametric modeling to some extent and can be free from the smoothing parameter selection problem of nonparametric modeling. For such purpose, we propose the detection method of Granger causality in quantiles using the stationary vine copula models. Our test has the computational advantage over the nonparametric tests in bootstrap approximation of the null distribution of the test statistic. Furthermore, when it is applied to various simulated data, our test shows good performance in terms of size and power compared to the previously proposed method. Finally, we analyze the causal relationship among cryptocurrencies.
N-III-1-3
Minimax estimation in multi-task regression under low-rank structures
*박관영(고려대), 구자용(고려대)
Summary: This study investigates the minimaxity of a multi-task nonparametric regression problem. We formulate a simultaneous function estimation problem based on information pooling across multiple experiments under a low-dimensional structure. A nonparametric reduced rank regression estimator based on the nuclear norm penalization scheme is proposed to incorporate the low-dimensional structure in the estimation process. Minimax upper and lower bounds are established under various asymptotic scenarios to examine the role of the low-rank structure in determining optimal rates of convergence. The results confirm that exploiting the low-rank structure can significantly improve the convergence rate for the simultaneous estimation of multiple functions.
학생 세션 TOP
S-II-1-1
(취소) Robust completion for partially observed functional data
*김현성(중앙대), 임예지(중앙대), 박연주(U. of Texas)
Summary: In recent years, applications have emerged that produce partially observed functional data, where each trajectory is collected over individual-specific subinterval(s) within the whole domain of interest. Robustness to atypical partially observed curves in the application is a practical concern, especially in the dimension reduction step through functional principal component analysis (FPCA). Existing studies implemented FPCA by applying smoothing techniques to estimate mean and covariance functions under irregular functional data structure, however, its estimation is easily affected by outlying curves with heavy-tailed noises or spikes. In this study, we investigate the robust method for the mean and covariance estimations by using bounded loss function, and it enables us to obtain robust functional principal components under partially observed functional data. Using the functional principal scores, we reconstruct the missing parts of trajectories and detect outliers. Numerical experiments show that our method provides a stable and robust estimation when the data contain the atypical curves.
S-II-1-2
Multivariate response quantile regression with sparse group Lasso penalty
*김현진(성균관대), 이은령(성균관대), 박세영(성균관대)
Summary: We consider a multivariate response quantile regression with high-dimensional covariates. Motivated by cancer cell line encyclopedia (CCLE) data analysis, we aim at estimating     ∈ ∆ under structured sparsity assumption, where the underlying coefficient matrix  is simultaneously element-wise and row-wise sparse, and ∆ is an interval of quantile levels. For such sparsity assumption, we propose a penalized composite quantile estimator based on B-spline approximation. We prove that our estimator enjoys an oracle property. A novel information criterion is proposed for model selection. Numerical examples and CCLE data are used to demonstrate effectiveness of the proposed method.
S-II-1-3
Regularization paths of L1-penalized ROC Curve-Optimizing Support Vector Machines
*김형우(고려대), 손인석(Arontier), 신승준(고려대)
Summary: The receiver operator characteristic (ROC) curve is one of the most popular tools to evaluate the performance of binary classifiers in a variety of applications. Rakotomamonjy (2004) proposed the ROC-SVM that directly optimizes the area under the ROC curve instead of the prediction accuracy. In this article, we study the L1-penalized ROC-SVM that directly optimizes the ROC curve. We first show that the L1-penalized ROC-SVM has piecewise linear regularization paths and then develop an efficient algorithm to compute the entire paths, which greatly facilitates its tuning procedure
S-II-1-4
다지역 임상시험을 위한 임의절편 계층적 선형 모형
*박천균(연세대), 강승호(연세대)
Summary: 다지역임상시험 데이터는 여러 지역으로 구성되어 있고 환자 모집단은 각 지역에 포함되어 있다는 점에서 계층적인 구조를 갖는다. 그러한 계층적 특징을 반영하는 모형으로 Kim and Kang (2020)은 임의기울기를 갖는 계층적 선형 모형을 제안하였다. 임상시험에서 가장 중요한 통계적 추 론은 전체 약효에 대한 검정으로 모형 선택 시 제 1종 오류율을 통제할 수 있는지 확인 하는 것이 필 요하다. 하지만, Kim and Kang (2020)이 제안한 모형은 지역의 수가 적은 경우 임의효과를 정확히 추 정하지 못하기 때문에 경험적 제 1종 오류율을 명목수준 하에서 통제할 수 없었다. 대안으로 본 논 문에서는 임의효과를 절편에 갖는 계층적 선형 모형을 제안한다. 본 모형은 임의효과를 정확히 추 정하지 못하더라도 전체 약효에 대한 검정에 영향을 미치지 않기 때문에 제 1종 오류율을 명목수준 하에서 통제할 수 있다. 이 외에도, 본 논문에서는 다양한 통계학적인 관점에서 두 모형의 장단점을 비교한다.
S-II-1-5
Residual flipped pseudo observations based variable selection
*신우영(고려대), 정윤서(고려대)
Summary: Variable selection methods by penalized models lead us to distinguish significant or non-significant variables in various fields. However, penalized models could not be catch up with zero coefficients when the sample size  is small compared to dimension . In this regard, we propose a new variable selection method using residuals from the initially fitted penalized regression models. Our method makes coefficients more shrinkage than the initially fitted model. Throughout this shrinkage, it could distinguish true significant variables. It could adjust L2 penalized regression models and quantile penalized regression models. Through simulation studies and real data examples, we compare the performance of our method by changing ratio of  with .
S-II-1-6
Merging Components in Linear Gaussian Cluster-Weighted Model
*오상곤(성균관대), 서병태(성균관대)
Summary: Cluster-weighted models (CWMs) are useful tools to find latent functional relationships between the response and coavariates and have comparable prediction powers. However, due to extra distributional assumptions on convarates, they often suffer from a misspecification problem which undermines prediction power and makes the model interpretation hard. In this paper, we propose a new type of cluster-weighted model by imposing a hierarchical structure to the component distribution so that we can obtain a more flexible but parsimonious cluster-weighted model. The proposed method provides more interpretable clusters in the data than existing methods and also has comparable prediction performance with popular machine learning models.
S-III-1-1
Self-starting control charts for social network surveillance
*이주원(중앙대), 이재헌(중앙대)
Summary: Recently the need for network surveillance to detect abnormal behavior within dynamic social networks has increased. We consider a dynamic version of the degree corrected stochastic block model (DCSBM) to simulate dynamic social networks and to monitor for a significant structural change in these networks. To apply a control charting procedure to network surveillance, in-control model parameters must be estimated from the Phase I data, that is from historical data. In network surveillance, however, there are many situations where sufficient relevant historical data are unavailable. In this paper we propose a self-starting Shewhart control charting procedure for detecting change in the dynamic networks. This procedure can be a very useful option when we have only a few initial samples for parameter estimation. Simulation results show that the proposed procedure has good in-control performance even when the number of initial samples is very small.
S-III-1-2
Variable Selection for Ultra-High Dimensional Data with Measurement error
*이하정(성균관대), 김재직(성균관대)
Summary: Due to the rapid development of high-throughput technologies, nowadays ultra-high dimensional data are common and play a very important role in genomic, biological, and chemical fields. One of important issues in such data is how to select important variables in regression or classification problems and many methods to solve this problem have been developed so far. However, since, in general, high-throughput equipment have their own intrinsic measurement errors, it could interfere in the selection of important variables and it also might give falsely discovered variables. To alleviate this problem, we propose an iterative variable selection method using marginal likelihood and regularization methods that consider measurement errors. From the proposed method, the number of falsely discovered variables under measurement error can be reduced. The performance of the proposed method is verified through simulation studies and it is applied to gene expression data.
S-III-1-3
(취소) High-Dimensional Confounding Adjustment Using Functional Data Analysis
*차상훈(경북대), 송준진(Baylor U.), 이경은(경북대)
Summary: Estimating treatment effect in observational study results in a bias due to the presence of confounders that disturb causal inference. Propensity score analysis (PSA) is commonly used to reduce this bias. As technology advances, densely measured variables are observed and formed into high-dimensional data. In high-dimensional data, estimation of propensity score (PS) is unstable due to multicollinearity, which may eventually lead to biased estimate of treatment effect. To deal with the potential problem, this study considers high-dimensional data as functional data, and applies it to PSA. In simulation and real data analysis, we compare functional PSA and conventional PSA and find that functional PSA provides less bias in the treatment effect estimation. Also, bootstrap procedure is used to adjust the variance estimator of treatment effect.
S-III-1-4
A Robust Bayesian Concordance Correlation Coefficient for Vector Measurements
*최영태(경북대), 이두형(경북대), 이경은(경북대)
Summary: The concordance correlation coefficient (CCC) is a popular index for evaluating the agreement of two observers. Many studies have extended CCC for multiple observers, data with replication, and data with repeated measurements. In this paper, we propose the definition of CCC for vector measurements (CCCV) which can be applied when one reading for an observer includes several measurements. We also propose three numerical ways for estimating CCCV. To overcome some of their drawbacks such as outlier sensitivity, we propose a robust Bayesian CCCV. We illustrate the performance of the proposed methods with both simulation studies and a real-life example.
S-III-1-5
인스타그램 데이터를 이용한 스포츠클라이밍의 지역차이 분석
*한소율(중앙대), 박영호(한남대)
Summary: 스포츠클라이밍은 도심 속에서 인공으로 만들어진 암벽을 등반하는 스포츠이다. 스포츠클 라이밍 관련 선행연구로는 클라이밍 운동의 능력을 향상시키기 위한 연구들이 주를 이루고 있었으 나 최근 들어 선수 중심의 연구에서 일반인 대상으로 하는 클라이밍 짐 시설에 따른 이용자 만족도 연구 및 스포츠 클라이밍의 신체 효능에 대한 생리학 연구 등으로 확대되고 있다. 2020년 도쿄올림 픽에서 처음으로 스포츠클라이밍을 선보였으며, 올림픽 정식 종목으로 채택되면서 일반인들의 관 심이 증가하기 시작했으나 스포츠클라이밍의 활성화 현황 연구는 현재까지 거의 수행되지 않았다. 따라서 스포츠클라이밍의 인기가 증가했는지 현황을 파악하는 것이 필요하나 관련 연구가 미비한 실정이다. 따라서 본 연구는 스포츠클라이밍의 활성화 정도를 확인하고자 실시간 데이터인 인스타 그램 데이터를 활용하여 스포츠클라이밍의 인기를 확인하고자 한다. 인스타그램 데이터를 선택한 이유는 스포츠클라이밍은 주로 10-30대가 하는 운동으로서, 10-30대가 가장 많이 사용하는 SNS인 인스타그램을 사용했으며 해당 데이터는 스포츠클라이밍 이용자를 가장 잘 대표할 것으로 기대된 다. 이를 확인하기 위해 파이썬 프로그램으로 크롤링하여 13,800개의 게시물 데이터를 수집했으나 중복으로 수집된 데이터를 제거한 후 분석에 사용된 최종 데이터는 9,498건이다. 크롤링으로 수집 한 장소 데이터는 Kakao API를 이용해 위경도 좌표를 제공받아 주소 데이터를 생성해 지역별 클라 이밍 운동에 차이가 있는지 비모수 평균비교 검정을 수행하였다. 그 결과, 서울과 인천⋅경기, 경상 도 지역 간에는 활성화 차이가 없었으나 그 이외의 지역 간에는 차이가 있는 것으로 나타났다. 이는 스포츠클라이밍 운동이 지역별로 편차가 존재하는 것으로 보이며 우리나라에서는 대도시 중심으 로 유행하는 운동임을 보여주고 있다. 인스타그램의 특성상 날짜 설정이 불가능하여 연도별 데이터 를 수집하지 못해 연도별 지역별 차이 검정을 할 수 없다는 제한점이 있다.
S-III-1-6
A Tree-based Scan Statistic for Detecting Signals of Drug-Drug Interactions in Spontaneous Reporting Databases
*허석재(연세대), 정인경(연세대)
Summary: The clinical trials generally focus on the single drug safety and efficacy rather than effects of drug-drug interactions (DDIs). However, concomitant use of multiple drugs can increase the risk of adverse events (AEs) due to DDIs. The proportion of AEs caused by DDI has been estimated to be around 30% of unexpected AEs. Therefore, detecting signals of AE caused by DDIs is as important as detecting signals of single drug-induced AE in post-market drug safety surveillance. Several statistical methodologies for signal detection of DDIs have been proposed, such as Ω shrinkage measure (Norèn et al., 2008), the chi-square statistics for screening AEs caused by DDI (Gosho et al., 2017), the combination risk ratio (Susuta and Takahashi, 2014), and the concomitant signal score (Noguchi et al., 2020). However, these methods have been developed without considering a hierarchical structure for an AE code, such as World Health Organization’s Adverse Reaction Terminology. Also, most of proposed methods do not reflect problems for potential reporting bias of spontaneous reporting systems, such as the under-reporting and relative over-reporting for specific drugs or AEs. In this study, we proposed signal detection method for DDIs based on the tree-based scan statistic, which simultaneously searches a node with relative high risk for large number of nodes in a database. Our proposed method can rule out the problems for potential reporting bias through several assumptions. We conducted simulation studies to compare the performance of the proposed method with existing methods for various settings. We also performed a real data analysis using the database of Korea Adverse Event Reporting System.
S-IV-1-1
Social network monitoring procedure based on partitioned networks
*홍휘주(중앙대), 이주원(중앙대), 이재헌(중앙대)
Summary: As interest in social network analysis increases, researchers are also interested in detecting changes in social networks. Changes in social networks appear as structural changes in the network. Therefore, detecting a change in a social network is detecting a change in the structural characteristics of the network. A local change in a social network is a change that occurs in a part of the network. It usually happens between close neighbors. The purpose of this article is to propose a procedure for efficiently detecting local changes that occur in the network. In this paper, to detect local changes more efficiently, the network is divided into sub-networks and each sub-network is monitored. By monitoring sub-networks, we can detect local changes more quickly and obtain information about where the changes are occurring. Simulation studies show that the proposed method is efficient when the network size is small and the amount of the change is small. In addition, under a fixed overall false alarm rate, when we partition the network into smaller sizes and monitor smaller sub-networks, it detects local changes better.
S-IV-1-2
Kernel-based hierarchical structural component models for pathway analysis
*황보수현(서울대), 이선영(서울대), 이승연(세종대), 황흥선(McGill U.), 김인영(Virginia Polytechnic Institute and State U.), 박태성(서울대)
Summary: We propose a new approach, Hierarchical structural CoMponent analysis using Kernel (HisCoM-Kernel). The proposed method models nonlinear association between biomarkers and phenotype by extending the kernel machine regression, and analyzes entire pathways simultaneously by using the biomarker-pathway hierarchical structure. Our simulation studies and real data analyses showed its superior performance compared to existing methods in identifying more biologically meaningful pathways, including those reported in previous studies.
S-IV-1-3
Developing weighted RFS for Type 2 Diabetes
*Apio Catherin(서울대), 정원일(숭실대), 문민경(서울대), 권오란(이화여대), 박태성(서울대)
Summary: Although it is well known that diet is very important in the development of type 2 diabetes (T2D), assessing overall diet patterns is challenging, and only a few indices that access dietary patterns have been proposed. Recommended Food Score (RFS) is a relatively simple method based on a food tally of the consumption frequency of food items emphasized in current dietary guidelines. The current RFS simply computes the average of all food item scores. Since this RFS does not consider all food characteristics well, we developed weighted RFSs using more sophisticated statistical methods; Hierarchical Structural Component model for the analysis of Food Score (HisCoM-RFS) and Partial Least Squares-discriminant analysis (PLSDA-RFS), to find optimal weights for each food item. Then, firstly we performed an association analysis between T2D and RFS with both logistic and Cox regression models. Secondly, a gene-diet interaction analysis focused on SNP and RFS interaction was performed using the above-mentioned models. Lastly, we stratified the subjects in low, intermediate, and high genetic risk and diet quality based on polygenic risk score and RFSs respectively and repeated the association analysis the Cox regression. We applied the above analyses to the Korean Genome and Epidemiology Study (KoGES Ansan-Ansung) cohort data. Pathway analysis with significant SNPs from the interaction analysis was performed to find pathways related with T2D. These analyses provide information on the influence of diet and the interactions between genetics and lifestyle, in the development of T2D.
S-IV-1-4
Bayesian pathway selection
*Nizeyimana Pacifique(경북대), 이경은(경북대), 김인영(Virginia Polytechnic Institute and State U.)
Summary: We propose a Bayesian pathway selection method that allows the selection of pathways (sets of genes) directly related to a continuous response variable under a non-parametric hierarchical model framework. This was motivated by the facts that of sets of genes explains more efficiently the response variable than a single variable and pathway interpretability. We utilize the method of stochastic search variable selection and kernel machine to select pathways associated to a continuous response outcomes after adjusting other covariates effects. The selection of pathways simultaneously works comparing to other methods where pathways are analyzed separately. We present simulation studies as well as real data application, and the results indicate that the model can successfully detect effective pathways associated to the continuous outcomes.
S-IV-1-5
Low-rank and sparse decomposition in multivariate regional quantile regression
*김소현(성균관대), 박세영(성균관대)
Summary: We propose multiple response regional quantile regression by imposing structural condition on the underlying coefficient matrix. This work is motivated by the analysis of cancer cell line encyclopedia (CCLE), which consists of resistance responses to multiple drugs and gene expression of cancer cell line. In the CCLE data analyses, we assume that only a few genes are relevant to the effect of drug resistance and some genes could have similar effects on multiple responses. To estimate the drug resistance response from gene information and to identify the genes responsible for the sensitivity of the resistance response to each drug, we propose a penalized multivariate quantile regression by decomposing the quantile coefficient function into the low-rank and sparse matrices. Low-rank part is a constant function of quantile levels, which represents the global pattern of the coefficient function, whereas the sparse matrix can be smoothly varying by quantile levels, which represents a local and specific pattern of the coefficient function. We compute the proposed penalized method via alternating direction method of multipliers (ADMM) algorithm. We also propose the novel tuning parameter selection using GIC to select parsimonious model with good prediction ability. In our numerical analysis using simulated data the proposed method better predicts drug responses compared with the other methods.
S-IV-1-6
The Effect of Rebalancing on LDA in Imbalanced Classification
김경희(고려대), *정현우(고려대)
Summary: One of the remedies to handle class imbalance is rebalancing with the optimal rate. Theoretical derivation of the rate is not usually considered and often empirically detected. We used a linear discriminant classifier, deriving the theoretical optimal rate that maximizes the Matthews Correlation Coefficient (MCC) and F1 score under normality. Our findings suggest that with a careful consideration on the level of class imbalance and the separability between two classes, we can achieve better classification results in presence of class imbalance.