Feed SQ | 디 이코노미

Modeling Digital Advertising Data with Measurement Error: Poisson Time Series and Poisson Kalman Filter Approach

Published

2023-09-22 12:00

Jeongwoo Park*

* Swiss Institute of Artificial Intelligence, Chaltenbodenstrasse 26, 8834 Schindellegi, Schwyz, Switzerland

Abstract

This study examines the impact of measurement error, an inherent problem in digital advertising data, on predictive modeling. To do this, we simulated measurement error in digital advertising data and applied a GLM(Generalized Linear Model) based and an Kalman Filter based moodel, both of which can partially mitigate the measurement error problem. The results show that measurement errors can trigger regularization effects, improving or degrading predictive accuracy, depending on the data. However, we confirmed that reasonable levels of measurement error did not significantly impact our proposed models. In addition, we noted that the two models we applied showed heterogeneity depending on the data size, hence we applied an ensemble-based stacking technique that combines the advantages of both models. For this process, we designed our objective function to apply different weights depending on the precision of the data. We confirmed that the final model displays better results compared to the individual models.

PDF View PDF Download

1. Introduction

1.1 Background

Digital advertising has exploded in popularity and has become a mainstream part of the global advertising market, offering new areas unreachable by traditional media such as TV and newspapers. In particular, as the offline market shrank during the COVID-19 pandemic, the digital advertising market gained more attention. Domestic digital marketing spend grew from KRW 4.8 trillion in 2017 to KRW 6.5 trillion in 2019 and KRW 8.0 trillion in 2022, a growth of about 67\% in five years, and accounted for 51\% of total advertising expenditure as of 2022\cite{KOBACO}.

The rise of digital advertising has been driven by the proliferation of smartphones. With the convenience of accessing the web anytime and anywhere, which is superior to PCs and tablets, new internet-based media have emerged. Notably, app-based platform services that provide customized services based on user convenience have rapidly emerged and significantly contributed to the growth of digital advertising.

Advertisers prefer digital advertising due to its immediacy and measurability. Traditional medias such as TV, radio, and offline advertising make it challenging to elicit immediate reactions from consumers through advertisements. At best, post-ad surveys can gauge brand recognition and the predilection to purchase its products when needed. However, in digital advertising, a call to action button leading to a purchase page can precipitate quick consumer responses before diminishing brand recall and purchase intentions.

In addition, in traditional advertising media, it is difficult to accurately measure the number of people exposed to the ad and the effect of conversions through the ad. Especially, due to the lag effect of traditional media mentioned above, there are limitations in retrospecting the ad performance based on the subsequent business performance as the data rife with noise must be taken into account. Therefore, there is a problem of distinguishing whether the incremental effect of business performance is caused by advertising or other exogenous variables. In digital advertising, on the other hand, 3rd party ad tracking services store user information on the web/app to track which ad users responded to and subsequent behavior. The benefits of immediacy and measurability help advertisers to quickly and accurately determine the effectiveness of a particular ad and make decisions.

However, with the advent of measurability came the issue of measurement errors in the data. There are many sources of measurement error in digital ad data, such as a user responding to an ad multiple times in a short period of time, or ad fraud, which is the manipulation of ad responses for malicious financial gain. As a result, ad data providers regularly update their ad reports up to a week to provide updated data to ad demanders.

1.2 Objectives

In this study, we aim to apply a model that can reasonably make predictions based on data with inherent measurement errors. The analysis has two main objectives: first, we will verify the impact of measurement error on the prediction model. We will perform simulations for various cases, considering that the innovation may vary depending on the size of the measurement error and the data period. Second, we will present several models that take into account the characteristics of the data and propose a final model that can robustly predict the data based on these models.

2. Key Concepts and Methods

Endogeneity and Measurement Error

A regressor is endogenous, if it is correlated with the error in the regression models. Let $E(\epsilon_{i} | x_{i}) = \eta$. Then the OLS estimator, b, is biased since

$\DeclareMathOperator*{\plim}{plim}$

\begin{align}
E(b | X) = \beta + (X'X)^{-1}X'\eta \neq \beta
\end{align}

So the Gauss-Markov Theorem no longer holds. Also, the estimator is inconsistent since

\begin{align}
\plim b = \beta + \plim (\frac{X'X}{n})^{-1} \plim (\frac{X'\epsilon}{n}) \neq \beta
\end{align}

Endogeneity can be induced by major factors such as omitted variable bias, measurement error, and simultaneity. In this study, we focus on the problem of measurement error in the data.

Measurement error refers to the problem where data, due to some reason, differs from the true value. Measurement error is divided into systematic error and random error. Systematic error refers to the situation where the measured value differs from the true value due to a specific pattern. For example, a scale might be incorrectly zeroed, giving a value that is always higher than the true value. Random error means that the measurement is affected by random factors that deviate from the true value.

While systematic errors can be corrected by data preprocessing to handle specific patterns in the data,random error characteristically requires data modeling for random factors. In theory, various assumptions can be made about the random factor, it is generally common to assume errors follow a Normal distribution.

We will cover the regression coefficient of classical measurement error model with normally distributed random errors. Consider the following linear regression:

\begin{align}
y = \beta x + \epsilon
\end{align}

And we define $\tilde{x}$ with measurement error as follows.

\begin{align}
\tilde{x} = x + u
\end{align}

Substitute (4) into (3):

\begin{align}
y = \beta (\tilde{x} - u) + \epsilon = \beta \tilde{x} + (\epsilon - \beta u)
\end{align}

Hence,

\begin{gather}
b = (X'X)^{-1}X'y \\
\plim b = (\frac{\sigma_{x}^{2}}{\sigma_{x}^{2} + \sigma_{u}^{2}})\beta
\end{gather}

When measurement error occurs as mentioned above, the larger the magnitude of the measurement error, the greater the regression dilution problem, where the estimated coefficient approaches zero. In the extreme case, if the explanatory variables have little information so the measurement error has most of the information, the model will treat them as just noise and the regression coefficient will be close to zero. This problem occurs not only in simple linear regression, but also in multiple linear regression.

In addition to the additive case, where the measurement error is added to the original variable, we can also consider a multiplicative case where the error is multiplied. In the multiplicative case, the regression dilution problem occurs as follows.

\begin{gather}
\tilde{x} = xw = x + u \\
u = x(w - 1)
\end{gather}

Similarly, substituting (9) into (3) yields a result similar to (7), where the variance of the measurement error $u$ is derived as follows.

\begin{align}
\sigma_{u}^{2} = E[X(w - 1)X(w - 1)] = E(w^{2}X^{2} - 2wX^{2} + X^{2}) = \sigma_{w}^{2}(\sigma_{x}^{2} + \mu_{x}^{2})
\end{align}

Therefore, in the case of measurement error, the sign of the regression coefficient does not change, but the size of the regression coefficient gets attenuated, making it difficult to quantitatively measure the effect of a certain variable.

However, let us look at the endogeneity problem from a perspective of prediction, where the importance lies solely in accurately forecasting the dependent variable rather than the explanatory context where we try to explain phenomena through data - and so the size and sign of coefficients are not crucial. Despite the estimation of the regression coefficient being inconsistent in an explanatory context, there is a research that residual errors, which are crucial in the prediction context, deem that endogeneity is not a significant issue\cite{Greenshtein}.

Given these results and recent advancements in computational science, countless non-linear models have been proposed, which could lead one to think that the endogeneity problem is not significant when focusing on the predictive perspective. However, the regression coefficient decreases due to measurement error included in the covariates, resulting in model underfitting compared to actual data. We will later discuss the influence of underfitting due to measurement error.

Heteroskedasticity

Heteroscedasticity means that the residuals are not equally distributed in OLS(Ordinary Least Squares). If the residuals have heteroskedasticity in OLS, it is self-evident by the Gauss-Markov theorem that the estimator is inefficient from an analytical point of view. It is also known that in the predictive perspective, heteroskedasticity of residuals in nonlinear models can lead to inaccurate predictions during extrapolation.

In digital advertising data, measurement error can induce heteroskedasticity, in addition to the endogeneity problem of measurement error itself. As mentioned in the introduction, the size of the measurement error decreases the further back in time the data is from the present, since the providers of advertising data are constantly updating the data. Therefore, the characteristic of varying measurement error sizes depending on the recency of data can potentially induce heteroskedasticity into the model.

Poisson Time Series

Poisson Time Series is a model based on the Poisson Regression that uses the log-link as the link function in GLM(Generalized Linear Model) class, with additional autoregressive and moving average terms. The key difference between the Vanilla Poisson Regression and ARIMA-based model is that the time series parameter are set to reflect the characteristics of the data following the conditional Poisson distribution.

Let us set the log-link $\log(\mu) = X\beta$ from the GLM as. In this case, the equation considering the additional autocorrelation parameters are as follows.

\begin{align}
\log(\lambda_{i}) = \beta_{0} + \sum_{j=1}^{p}\beta_{j}\log(Y_{i-j} + 1) + \sum_{l=1}^{q}\alpha_{l}\log(\lambda_{i-l}) + \eta'X
\end{align}

Where $\beta_{0}$ is the intercept, $\beta_{j}$ is the autoregressive parameter, $\alpha_{l}$ is the moving average parameter, and $\eta$ is the covariate parameter. The estimation is done as follows. Consider the log-likelihood

\begin{align}
l(\theta) = \sum_{i=1}^{n}\log p_{i}(y_{i} | \theta) = \sum_{i=1}^{n}(y_{i}\log(\lambda_{i}(\theta)) - \lambda_{i}(\theta))
\end{align}

and the Score function is derived as follows

\begin{align}
S(\theta) = \frac{\partial l(\theta)}{\partial \theta} = \sum_{i=1}^{n}(\frac{y_{i}}{\lambda_{i}(\theta)} - 1)\frac{\partial \lambda_{i}(\theta)}{\partial \theta}
\end{align}

By iteratively calculating the score function using the mean-variance relationship assumed in the GLM, the information matrix is derived as follows. For Poisson Regression, it is assumed that the mean and variance are the same.

\begin{align}
I(\theta) = \sum_{i=1}^{n} Var(\frac{\partial l(\theta)}{\partial \theta}) = \sum_{i=1}^{n}(\frac{1}{\lambda_{i}(\theta)})(\frac{\partial \lambda_{i}(\theta)}{\partial \theta})(\frac{\partial \lambda_{i}(\theta)}{\partial \theta})'
\end{align}

To estimate the parameters maximizing the information matrix, we perform Non-Linear Optimization using the Quasi-Newton Method algorithm. While the MLE needs to assume the overall distribution shape, thus being powerful but difficult to use in some cases. But the Quasi-Newton method computes the quasi-likelihood by assuming only the mean-variance relationship of a specific distribution. Generally, it is known that Quasi-MLE derived using the Quasi-Newton method also satisfies the CUAN(Consistent abd Uniformly Asymptotically Normal), given a well-defined mean-variance relationship, similar to MLE. However, it is inefficient estimator compared to MLE, when MLE computation is possible.

One of the advantages of a Poisson Time Series model based on GLM in this study is that GLM does not assume the homoskedasticity of residuals, focusing only on the mean-variance relationship. This allows, to a certain extent, bypass the problem of heteroskedasticity in residuals that can occur when the sizes of measurement errors in varying observation periods.

Poisson Kalman Filter

The Kalman Filter is one of the state space model class, which combines state equations and observation equations to describe the movement of data. When observations are accurate, the weight of the observation equation increases, and on the other hand, when the observations are inaccurate, correcting values derived through the state equation. This feature allows for the estimation of data movements even when the data is inaccurate, like in the case of measurement error, or when data is missing.

Let us consider the Linear Kalman Filter, a representative Kalman Filter model. Assuming a covariate $U$, the state equation representing the movement of the data is given by

\begin{align}
x_{t} = \Phi x_{t-1} + \Upsilon u_{t} + w_{t}
\end{align}

Where $w_{t}$ is an independent and identically distributed error that follows Normal distribution, assuming $E(W) = 0$ and $Var(W) = Q$.

The Kalman Filter uses observation equation to update its predictions, where the equation is

\begin{align}
y_{t} = A_{t}X_{t} + \Gamma u_{t} + v_{t}
\end{align}

Where $v_{t}$ is an independent and identically distributed error that follows the same Normal distribution as $w_{t}$, assuming $E(V) = 0$ and $Var(V) = R$.

Let $x_{0} = \mu_{0}$ be the initial value and $P_{0} = \Sigma_{0}$ be the variance of $x$. Recursively iterate over the expression below

\begin{gather}
x_{t} = \Phi x_{t-1} + \Upsilon u_{t}\\
P_{t} = \Phi P_{t-1} \Phi ' + Q
\end{gather}

with

\begin{gather}
x := x_{t} + K_{t}(y_{t} - A_{t}x_{t} - \Gamma u_{t})\\
P := [I -K_{t}A_{t}]P_{t}
\end{gather}

where

\begin{align}
K_{t} = P_{t}A_{t}'[A_{t}P_{t}A_{t}' + R]^{-1}
\end{align}

The process of updating the data in (19) and (20) utilizes ideas from Bayesian methodology, where the state equation can be considered as a prior that we know in advance, and the observation equation as a likelihood. The Linear Kalman Filter is known to have the minimum MSE(Mean Squared Error) among linear models if the model specification well (process and measurement covariance are known), even if the residuals are not Gaussian.

The Poisson Kalman Filter is a type of extended Kalman Filter. The state equation can be designed in a variety of ways, but in this study, the state equation is set to be Gaussian, just like the Linear Kalman Filter. Instead, similar to the idea in GLM, we introduce a log-link in the observation equation, which can be expressed as

\begin{gather}
E(y_{i} | \theta_{i}) = Var(y_{i} | \theta_{i}) = \exp^{\theta_{i}} \\
\theta_{i} = \log(\lambda_{i})
\end{gather}

We define $K_{t}$, which is derived from (21), as the Kalman Gain. It determines the weight of the values derived from the Observation Equation in (19), which can be laid between 0 and 1. Noting the expression in (21), we can see that the process by which $K_{t}$ is derived has the same structure as how $\beta$ is shrunk in (7). Whereas in (7) the magnitude of $\sigma_{u}^{2}$ determined the degree of attenuation, in (21) the weight is determined by $R$, the covariance matrix of $v_{t}$ in the observation equation. Finally, even if there is a measurement error in the data, the weight of the state equation can be increased by the magnitude of the measurement error, indicating that the Kalman Filter inherently solves the measurement error problem.

Ensemble Methods

Ensemble Methods combine multiple heterogeneous models to build a large model that is better than the individual models. There are various ways to combine models, such as bagging, boosting, and stacking. In this study, we used the stacking method that combines models appropriately using weights.

Stacking is a method that applies a weighted average to the predictions derived from heterogeneous models to finally predict data. It can be understood as solving an optimization problem that minimizes an objective function under some constraints, and the objective function can be flexibly designed according to the purpose of the model and the Data Generating Process(DGP).

3. Data Description

3.1 Introduction

The raw data used in the study are the results of digital advertising run over a specific period in 2022. The independent variable is the marketing spend, and the dependent variable is the marketing conversion. Since the marketing conversion, such as 1, 2, etc. are count data with a low probability of occurrence, it can be inferred that modeling based on the Poisson model would be appropriate.

3.2 Data Preprocessing and Assumptions

The raw data were filtered with only performance data generated from marketing channels using marketing spend out of overall marketing performance. Generally, marketing performance obtained using marketing spend is referred to as "Paid Performance", while performance gained without using marketing spend is classified as "Organic Performance". There may be a correlation between organic and paid performance depending on factors such as the size of the service, brand recognition, and some exogenous factors. Moreover, each marketing channel has different influences, and they can affect each other, suggesting the application of a hierarchical model or a multivariate model. However, in this study, a univariate model was applied.

To verify the impact of measurement error, observation values were created by multiplying the actual marketing spend (true value) by the size of the measurement error. The reason for setting it multiplicatively is that the size of the measurement error is proportional to the marketing spend. At this point, considering that the observation value is inaccurate the more recent the data, the measurement error was set to increase exponentially the more it gets closer to the most recent value. As mentioned in the introduction, considering that media executing ads usually update data up to a week, measurement errors were applied only to the most recent 7 data points. The detailed process of the observed value is as follows.

\begin{gather}
e_{i} = \max(0, 1 + \eta_{i})\\
\eta_{i} \sim N(0, a(1+r)^{-\min(0, n-(i+7))})\\
spend^{*}_{i} = e_{i} * spend_{i}
\end{gather}

Where $e_{i}$ is the parameter representing the measurement error at time $i$. Since the ad spend cannot be negative, we set the Supremum to zero. The error is randomly determined by two parameters, $a$ and $r$, where $a$ is the scaling parameter and $r$ is the size of the error. We also accounted for the fact that the measurement error decreases exponentially over time.

As mentioned earlier, this measurement error is multiplicative, which can cause the variance of the residuals to increase non-linear. The magnitude of the measurement error is set to $[0.5, 1]$, which is not out of the domain, and simulated by Monte Carlo method ($n = 1,000$).

4. Data Modeling

Based on the aforementioned data, we define the independent and dependent variables for modeling. The dependent variable $count_{i}$ is the marketing conversion at time $i$, and the independent variable is the marketing spend at time $[i-7, i]$. The dependent variable is assumed to follow the following conditional Poisson distribution.

\begin{align}
count_{i} | spend_{i} \sim pois(\lambda)
\end{align}

The lag variable before the 7-day reflects the lag effect of users who have been influenced by an ad in the past, which causes marketing conversion to occur after a certain amount of time rather than on the same day. The optimal time may vary depending on the type of marketing action and industry, but we used 7-day performance as a universal.

First, let us apply a Distributed Lag Poisson Regression with true values that do not reflect measurement error and do not reflect autocorrelation effects. The equation and results are as follows.

\begin{align}
\log(\lambda_{t}) = \beta_{0} + \sum_{i=1}^{8}\beta_{i}Spend_{(t-i+1)}
\end{align}

Table 1: Summary of Distributed Lag Poisson Regression

The results show that using the lag variable of 7 times is significant for model fit. To test the autocorrelation of the residuals, we derived ACF(Autocorrelation Function) and PACF(Partial Actucorrelation Function). In this case, we used Pearson residuals to consider the fit of the Poisson Regression Model.

Figure 3: ACF Plot of Distributed Lag Poisson Regression

Figure 4: PACF Plot of Distributed Lag Poisson Regression

By the graph, there is autocorrelation in the residuals, so we need to add some time series parameters to reflect the model. The model equation with an autoregressive, mean average parameter that follows a Poisson distribution is as follows.

\begin{align}
\log(\lambda_{t}) = \beta_{0} + \sum_{k=1}^{7}\beta_{k}\log(Y_{t-k} + 1) + \alpha_{7}\log(\lambda_{t-7}) + \sum_{i=1}^{8}\eta_{i}Spend_{(t-i+1)}
\end{align}

Where $\eta$ is the marketing spend used as an independent variable, $\beta$ is the intercept, and $\alpha$ is the unobserved conditional mean of the lagged variable of the dependent variable before 7 times, log-transformed into a log-linear model, which reflecting seasonality. The $\beta$ allows us to include effects that may affect the model other than the marketing spend used as a covariates, and the $\alpha$ is inserted to account for the effect of day of the week since the data is daily.

The results show that the lagged variables, $\alpha$ and $\beta$, are significant before 7 times. The quasi log-likelihood is also -874.725, which is a significant increase from before, and the AICc and BIC, which are indicators of model complexity, are also better for the Poisson Time Series.

Table 2: Summary of Poisson Time Series Model

As shown below, when deriving ACF and PACF with Pearson residuals, we can see that autocorrelation is largely eliminated. Therefore, the results so far show that Poisson Time Series is better than Distributed Lag Poisson Regression.

Figure 5: ACF Plot of Poisson Time Series

Figure 6: PACF Plot of Poisson Time Series

And, we will simulate and include measurement error in our independent variable, marketing spend, and see how it affects our proposed models.

5. Results

In this study, we evaluated the models on a number of criteria to understand the impact of measurement error and to determine which of the proposed models is superior. First, the "Prediction Accuracy" is an indicator of how well a model can actually predict future values, regardless of its fitting. The future values were set to 1 interval and measured by the Mean Absolute Error (MAE).

Since the characteristic of data follows time series structure, it is difficult to perform K-fold cross-validation or LOOCV(Leave One-Out Cross Validation) by arbitrarily dividing the data. Therefore, the MAE was derived by fitting the model with the initial $d$ data points, predicting 1 interval later, and then rolling the model to recursively repeat the same operation with one more data point. The MAE for the Poisson Time Series is as follows.

Table 3: Mean Absolute Error (# of simulations = 1,000)

We can see that as the magnitude of the measurement error increases, the prediction accuracy decreases. However, at low levels of measurement error, we actually see lower MAE on average compared to performance evaluation on real data. This implies that instead of inserting bias into the model, the measurement error reduced the variance, which is more beneficial from an MAE perspective. The expression for MSE as a function of bias and variance is as follows.

\begin{align}
MSE = Bias^{2} + Var
\end{align}

If $Var$ decreases more than $Bias^{2}$ increases, we can understand that the model has developed from overfitting. MAE is the same, just a different metric. Therefore, with a reasonable measurement error size, the attenuation of the regression coefficient on the independent variable due to the measurement error can be understood as a kind of regularization effect.

However, for measurement errors above a certain size, the MAE is higher on average than the actual data. Therefore, if the measurement error is large, it is necessary to continuously update with new data by comparing with the data that is usually updated continuously, or to reduce the size of the measurement error by using the idea of repeated measures ANOVA(Analysis of Variance).

In some cases, you may decide that it is better to force additional regularization from the MAE perspective. In this case, it would be natural to use something like Ridge Regression, since the measurement error has been acting to dampen the coefficient effect in the same way as Ridge Regression.

Depending on the size of the data points, the influence of measurement error will decrease as the number of data points increases. This is because the error of measurement is only present for the last 7 data points, regardless of the size of the data points, hence the error of measurement gradually decreases as a percentage of the total data. Therefore, we can see that the impact of error of measurement is not significant in modeling situations where we have more than a certain number of data points.

However, in the case of digital advertising, there may be issues such as terminating ads within a short period of time if marketing performance is poor. Therefore, if you need to perform a hypothesis test with short-term data, you need to adjust the significance level to account for the effect of measurement error.

The 2SLS(2 Stage Least Squares) model, inserted in the table, will be proposed later to check the efficiency of the coefficients. Note that the 2SLS has a high MAE due to initial uncertainty, but as the data size increases, the MAE decreases rapidly compared to the original model.

Next, we need to determine the nature of the residuals in order to make more accurate and robust predictions. Therefore, we performed autocorrelation and heteroskedasticity tests on the residuals.

The following results is the autocorrelation test on the Pearson residuals. In this study, the Breusch-Godfrey test used in the regression model was performed on lag 7. In general, the Ljung-Box test is utilized, but the Ljung-Box test is the Wald test class, which has a high power under the strong exogeneity(Mean Independent) assumption between the residuals and independent variables\cite{Hayashi}. Therefore, the strong exogeneity assumption about Wald test are not appropriate for this study, which requires a test for measurement error and the case of few data points. On the other hand, the Breusch-Godfrey test has the advantage of being more robust than the Ljung-Box test, because it assumes more relaxed exogeneity(Same Row Uncorrelated) assumption under the Score test class.

Table 4: p-value of Breusch-Godfrey Test for lag 7 (# of simulations = 1,000)

The test shows that the measurement error does not significantly affect the autocorrelation of the residuals.

Next, here are the results for the heteroskedasticity test. Although GLM-type models do not specifically assume homoskedasticity of the residuals, we still need to investigate the mean-variance relationship assumed in the modeling. To check this indirectly, we scaled the residuals as Pearson, and then performed a Breusch-Pagan test for heteroskedasticity.

Table 5: p-value of Breusch-Pagan Test (# of simulations = 1,000)

We can see that the measurement error does not significantly affect the assumed mean-variance relationship of the model. Consider the process of estimating the parameters in a GLM. The Information Matrix in (14) is weighted by the mean, whereas in Poisson Regression, the mean is same as variance, so it is weighted by the mean. Since it utilizes a weight matrix with a similar idea to GLS(Generalized Least Squares), it has the inherent effect of suppressing heterogeneity to a certain extent by giving lower weights to uncertain data.

On the other hand, we can see that the Breusch-Pagan test has a low p-value on some data points. If the significant level is higher than 0.05, the null hypothesis can be rejected. This is because there is a regime shift in the independent variable before and after $n = 47$, as shown in Fig. 1.

To test this, we performed a Quasi Likelihood Ratio Test(df = 9) between the saturated model, that considered the pattern change before and after the regime shift and the reduced model that did not consider it. The results are shown below.

Table 6: Quasi-LRT for Structural Break (Changepoint = 47)

Since the test statistic exceeds the rejection bound and is significant at the significance level 0.05. It can be concluded that the interruption of ad delivery after the changepoint, or the lower marketing spend compared to before, may have affected the assumed mean-variance relationship. We do not consider this in our study, but it would be possible to account for regime shifts retrospectively or use a Negative Binomial based regression model to account for this.

Next, we test for efficiency of statistics. Although this study does not focus on the endogeneity of the coefficients, we use a 2SLS model as the specification for the efficiency test. The proposed instrumental variable is ad impressions. The instrumental variable should have two characteristics: first, it should be "Relevant", which means that the correlation between the instrumental variable and the original variable is high. The variance of the regression coefficient estimated with the instrumental variable is higher than the variance of the model estimated with the original variable, and the higher the correlation, the more favorable it is to reduce the difference with the variance of the original variable(Highly Relevant). Since the ad publisher's billing policy is "Cost per Impression", the correlation between ad spend and impressions is significantly high.

On the other hand, "Validity" is most important for instrumental variables, which should be uncorrelated with the errors to eliminate endogeneity. In the digital advertising market, when a user is exposed to a display ad, the price of the ad is determined by two things: the number of "Impressions" and the "Strength of Competition" between real-time ad auction bidders. Since the effect of impressions has been removed from the residuals, it is unlikely that the remaining factor, the strength of competition among auction bidders, is correlated with the user being forced to see the ad. Furthermore, the orthogonality test below shows the difficulty in rejecting the null hypothesis of uncorrelated.

Table 7: p-value of Test for Orthogonality

Therefore, we can see that it makes sense to use "Impressions" as an instrumental variable instead of marketing spend. Here are the proposed 2SLS equations.

\begin{gather}
\hat{Spend}_{t} = \gamma_{0} + \gamma_{1}Imp_{t}\\
\log(\lambda_{t}) = \beta_{0} + \sum_{k=1}^{7}\beta_{k}\log(Y_{t-k} + 1) + \alpha_{7}\log(\lambda_{t-7}) + \sum_{i=1}^{8}\eta_{i}\hat{Spend}_{t-i+1}
\end{gather}

It is known that if there is measurement error in the instrumental variable, the number of impressions, but the random measurement error in the instrumental variable does not affect the validity of the model.

We performed the Levene test and Durbin-Wu-Hausman test to see the equality of residual variances. Below is the result of the Levene test.

Table 8: p-value of Levene Test (m = 0) (# of simulations = 1,000)

We can see that the measurement error does not significantly affect the variance of the residuals. Furthermore, 2SLS also shows that there is no significant difference in the variance of the residuals at the significance level 0.05. This means that the instrumental variable is highly correlated to the original variables.

The Durbin-Wu-Hausman test checks whether there is a difference in the estimated coefficients between the proposed model and the original model. If the null hypothesis is rejected, the measurement error has a significant effect and the variance of the residuals will be affected. The results of the test between the original model and the model with measurement error are shown in the table below. We can see that the presence of measurement error does not affect the efficiency of the model, except in a few cases.

Table 9: p-value of Durbin-Wu-Hausman Test (m = 0) (# of simulations = 1,000)

In addition, we check whether there is a difference in the coefficients between the proposed 2SLS and the original model. If the null hypothesis is rejected, it can be understood that there is an effect of omitted variables other than measurement error, which can affect the variance of the residuals. The results of the test are shown below.

Table 10: p-value of Durbin-Wu-Hausman Test (2SLS)

When the data size is small, the model is not well specified and the 2SLS is more robust than the original model, but above a certain data size, there is no significant difference between the two models. In conclusion, the results of the above tests show that the proposed Poisson Time Series does not show significant effects of measurement error and unobserved variables. This is because, as mentioned earlier, the weight matrix-based parameter estimation method of AR, MA parameters, and GLM class model inherently suppresses some of these effects.

In addition to the GLM based Poisson Time Series, we also proposed a State Space Model based Poisson Kalman Filter. In the Poisson Kalman Filter, the inaccuracy of the observation equation due to measurement error is inherently corrected by the state equation, which has the advantage of being robust to measurement error problem.

The table below shows the benchmark results between Poisson Time Series and Poisson Kalman Filter. You can see that the log-likelihood is always higher for the Poisson Time Series, but lower for the Poisson Kalman Filter in the MAE. This can be understood as the Poisson Time Series is more complex and overfitted, compared to the Poisson Kalman Filter.

However, after $n = 40$, the Poisson Time Series shows a rapid improvement in prediction accuracy. On the other hand, the Poisson Kalman Filter shows no significant improvement in prediction accuracy after a certain data point. This suggests that the model specification of the Poisson Time Series is appropriate beyond a certain data point.

We also compared the computational speed of the two models. We used "furrr" library in the R 4.3.1 environment, and ran 1,000 times each to derive the simulated value. In terms of computation time, the Poisson Time Series is about 1 second slower on average, but we do not believe this has a significant business impact unless you are in a situation where huge simulation is required.

The following table below shows the test results for the residuals between the Poisson Time Series and the Poisson Kalman Filter. We can see the heterogeneity between the two models. In the case of the Poisson Kalman Filter, we can see that the evidence of initial autocorrelation and homoscedasticity is high, but the p-value decreases above a certain data size. This means that the Poisson Kalman Filter is not properly specified, when the data size increases.

Finally, the PIT(Probability Integral Transform) allows us to empirically verify that the model is properly modeled by the mean-variance relationship. If the modeling was done properly, the histogram after the PIT should be close to a Uniform distribution. The farther it is from the Uniform distribution, the less it reflects the DGP of the original data. In the graph below, we can see that the Poisson Time Series shows values that do not deviate much from Uniform distribution, but the Poisson Kalman Filter results in values that are far from the distribution.

6. Ensemble Methods

So far, we have covered Poisson Time Series and the Poisson Kalman Filter. When the data size is small, the Poisson Kalman Filter is reasonable, but above a certain data size, the Poisson Time Series is reasonable. To reflect the heterogeneity of these two models, we want to derive the final model through model averaging. The optimization objective function is shown below.

\begin{gather}
p_{t+1} = argmin_{p}\sum_{i=1}^{t}w_{i}|y_{i} - (p \hat{y}_{i}^{(GLM)} + (1 - p) \hat{y}_{i}^{(KF)})|\\
s.t. \hspace{0.1cm} 0 \leq p \leq 1, \hspace{1cm} \forall w > 0
\end{gather}

The objective function is set in terms of minimizing the MAE, and different data points are weighted differently via the $w_{i}$ parameter. $w_{i}$ is the reciprocal of the variance at that point in time out of the total variance in precision, to reflect the fact that the more recent the data, the better the estimation and therefore the lower the variance. And the better the model, the lower the variance. The final weighted model prediction process is shown below.

\begin{align}
\hat{y}_{t+1} = p_{t+1}\hat{y}_{t+1}^{(GLM)} + (1 - p_{t+1})\hat{y}_{t+1}^{(KF)}
\end{align}

Below graph is the weights of the Poisson Time Series per data point derived from Stacking Methods. You can see that the weights are close to zero until $n = 42$, after which they increase significantly. In the middle, where the data becomes more volatile, such as the regime shift(blue vertical line), the weights are partially decreased.

The table below shows the results of the comparison between the final stacking model and the Poisson time series and Poisson Kalman Filter. First, we can see that the stacking model is superior in all times in the MAE, as it absorbs the advantages of both models, reflecting the Poisson Kalman Filter's advantage when the data size is small, and the Poisson Time Series' advantage above a certain data size. We can also see that the robustness test shows that the p-value of stacking model is laid between the p-values derived from both models.

7. Conclusion

We have shown the impact of measurement error on count data in the digital advertising domain. Even if the main purpose is not to build an analytical model but simply to build a model that makes better predictions, it is also important to check the measurement error in predictive modeling since the model may be underfitted by the measurement error, and the residuals may be heteroskedastic depending on the characteristic of the measurement error.

To this end, we introduced GLM based Poisson Time Series, and Poisson Kalman Filter, a class of Extended Kalman Filter, which can partially solve the measurement error problem. After applying these models to simulated data based on real data, the results of prediction accuracy and statistical tests were obtained.

In terms of prediction accuracy, we found that the magnitude of the coefficients is attenuated due to measurement error, causing a kind of regularization effect. For the data used in this study, we found that the smaller the measurement error, the better the prediction accuracy, while the larger the measurement error, the worse the prediction accuracy compared to the original data. We also found that the impact of the measurement error was relatively high when the data size was small, but as the data size increased, the impact of the measurement error became smaller. This is due to the nature of digital advertising data, where only recent data is subject to measurement error.

The test of residuals shows that there is no significant difference with and without measurement error. Therefore, the proposed models can partially avoid the problem of measurement error, which is advantageous in digital advertising data.

We also note that the two models are heterogeneous in terms of data size. When the data size is small and the impact of measurement error is relatively large, we found that the Poisson Kalman Filter, which additionally utilizes the state equation, is superior to the overspecified Poisson Time Series. On the other hand, as the data size increases, we found that the Poisson Time Series is gradually superior in terms of model specification accuracy. Finally, based on the heterogeneity of the two models, we proposed an ensemble class of stacking models that can combine their advantages. In the tests of prediction accuracy and residuals, the advantages of the two models were combined, and the final model showed better results than the single model.

On the other hand, while we assumed that the data follows a conditional Poisson distribution, some data points may be overdispersed due to volatility. This is evidenced by the presence of structural breaks in the retrospective analysis. If the data has overdispersion compared to the model, it may be more beneficial to assume a Negative Binomial distribution. Also, since the proposed data is a daily time series data, further research on increasing the frequency to hourly data could be considered. Finally, although we assumed a univariate model in this study, in the case of real-world digital advertising data, a user may be influenced by multiple advertising media simultaneously, so there may be correlation between media. Therefore, it would be good to consider a multivariate regression model such as SUR(Seemingly Unrelated Regression), which considers correlation between residuals, or GLMM(Generalized Linear Mixed Model), which considers the hierarchical structure of the data, in subsequent studies.

References

[1] Agresti, A. (2012). Categorical Data Analysis 3rd ed. Wiley.

[2] Biewen, E., Nolte, S. and Rosemann, M. (2008). Multiplicative Measurement Error and the Simulation Extrapolation Method. IAW Discussion Papers 39.

[3] Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.

[4] Czado, C., Gneiting, T. and Held, L. (2009). Predictive Model Assessment for Count Data. Biometrics 65, 1254-1261.

[5] Greene, W. H. (2020). Econometric Analysis 8th ed. Pearson.

[6] Greenshtein, E. and Ritov, Y. (2004). Persistence in high-dimensional linear predictor selection
and the virtue of overparametrization. Bernoulli 10(6), 971-988.

[7] Hayashi, F. (2000). Econometrics. Princeton University Press.

[8] Helske, J. (2016). Exponential Family State Space Models in R. arXiv preprint
arXiv:1612.01907v2.

[9] Hyndman, R. J., and Athanasopoulos, G. (2021). Forecasting: principles and practice 3rd ed.
OTexts. OTexts.com/fpp3.

[10] KOBACO. (2022). Broadcast Advertising Survey Report, 165-168.

[11] Liboschik, T., Fokianos, K. and Fried, R. (2017). An R Package for Analysis of Count Time Series
Following Generalized Linear Models. Journal of Statistical Software 82(5), 1-51.

[12] Liu, J. S. (2001). Monte Carlo Strategies in Scientific Computing. Springer.

[13] Montgomery, D. C., Peck, E. A. and. Vining, G. G. (2021). Introduction to Linear Regression
Analysis 6th ed. Wiley.

[14] Shmueli, G. (2010). To Explain or to Predict?. Statistical Science 25(3), 289-310.

[15] Shumway, R. H. and Stoffer, D. S. (2016). Time Series Analysis and Its Applications with R
Examples 4th ed. Springer.

Price Premium Discovery In Real Estate Auction Market: Decomposition Of The Korea Auction Sale Rate

Published

2023-09-22 12:00

Bohyun Yoo*

* Swiss Institute of Artificial Intelligence, Chaltenbodenstrasse 26, 8834 Schindellegi, Schwyz, Switzerland

Abstract

This study discovers and analyzes price premium (discount/surcharge) factors in the real estate auction market. Unlike existing bottom-up studies based on individual auction cases, a top-down time-series analysis is conducted, assuming that the price premium factor varies over time. To overcome limitations such as the difference between the court appraisal time* and the auctioned time, and the difficulty of using external data on court appraisals and price premium factors, the Fourier transform is utilized to extract the court appraisals and price premium factors in reverse. The extracted components are verified to determine if they can play a role as each factor. The price premium factor is found to have a similar movement to the difference in past values of the auction sale rate, and, as it signifies the discounts/surcharges in the auction market compared to the general market, it is named the “momentum factor”. Furthermore, by leveraging the momentum factor, the price premium can be differentiated by region, and the extent of the price premium applied can be distinguished over various time periods compared to the general market. Given the clustering tendency, the momentum factor can be a significant indicator for auction market participants to detect market changes.

PDF View PDF Download

1. Introduction

The housing auction market in Korea is one of the real estate markets, and many stakeholders such as mortgage banks, arbitrage investors, and non-performing loan operators are deeply involved. In general, there is a perception that the auction market is surcharged or discounted compared to the general market. If the auction market is an efficient and fair-trading market, it will not be different from the general market price, but most housing auction cases are implemented by default, so it is known that have legal issues and that applies as a discount factor. However, the bottom-up analysis based on individual auction cases, which is a method mainly used in previous studies on discounts and surcharges, is limited in time and space, and the time-varying effect cannot be considered, and the results of the analysis are limited and dependent on the data held by the researcher.

To overcome these limitations, it should be carried out the analysis from the market perspective, but the time series data Auction Sale Rate is unreliable as an indicator because the court appraiser price, which is the standard, is performed at the past rather than at the time of the auctioned price. It is difficult to specify the time of court appraisal as a variable in the model because it varies from case to case of individual auction how much it is in the past at the time of successful bid, and even if the time is known, the court appraisal price cannot be accurately estimated. Individual cases can be investigated in a bottom-up manner to return the point of view based on the general market price, but it is a very vast task and likewise a study limited to time and space.

The target of this paper is the apartment auction market, and to overcome the limitations of the auction sale rate, the auction sale rate is decomposed into three components in a top-down manner using Fourier transform. The proof of the decomposed each component is performed. And the price premium effect at the auction market is presumed and the reason is analyzed and the section discrimination in which the price premium effect acts is attempted. In addition, the time-varying beta through the Kalman filter is used to support the price premium effect, and the analysis of how the price premium effect differs in each region's market is also performed.

2. Literature review

Shilling et al (1990) analyzed the apartment auction in 1985 in the baton lounge, Louisiana, USA, and found an auction discount rate of -24%, Forgey et al (1994) analyzed houses from 1991 to 1993 in the United States and found that they were traded at a -23% discount. Spring (1996) analyzed foreclosures in Texas from 1991 to 1993 and found a 4-6% discount, Clauretie and daneshvary (2009) analyzed the housing auctions from 2004 to 2007 and found that about 7.5% of foreclosures were discounted because of endogenous and autocorrelation.

Campbell et al (2011) analyzed about 1.8 million housing transactions in Massachusetts and found that the discount rates for foreclosures and deaths were different. Zhou et al (2015) found that on average, 16 cities in the United States were discounted by 14.7%, Arslan, Guler & Tasking (2015) analyzed that a 1% increase in risk-free interest rates led to a 27% drop in house prices and a 3% increase in foreclosure rates.Jin (2010) compared and analyzed the general sale price and the auction price of apartments in Dobong-gu, Seoul and Suji-gu, Yongin-si, Korea, and found that the auction price is more discounted than the general transaction price. Lee (2012) noted that the real estate market is not efficient and is one of the anomalies of the discount / surcharge phenomenon in the apartment auction market.

Lee (2009) and Oh (2021) pointed out the limitations that occurred when the court appraisal price and the auctioned price were different and estimated the auction sale rate by correcting the court appraisal price to the auctioned time.

However, previous studies mainly focus on the analysis of variables in the bottom-up method along with the limitation of space and time based on individual auction cases. In addition, it is difficult to see the analysis in the same environment as Korea because the cases other than Korea adopt the open bidding system.

3. Materials and method

3.1. Decomposition of auction sale rate

Configuration of the auction sale rate defined as

\begin{equation} \label{eq:auction-sale-rate}
Auction\ Sale\ Rate\ _t=\frac{\sum_{i}\ Auctioned\ Price_{it}}{\sum_{i}\ Appraisal\ Price_{it-n}}\
\end{equation}

\begin{equation} \label{eq:auction-price}
Auctioned\ Price_t=\ Market\ Price_t\pm\ Price\ Premium_t\ (=discount\ or\ surcharge)
\end{equation}

\begin{equation} \label{eq:auction-sale-rate-price}
Auction\ Sale\ Rate\ _t=\frac{\sum_{i}\ (Market\ Price_t\ \pm\ Premium\ _t)}{\sum_{i}\ Appraisal\ Price_{t-n}}
\end{equation}

\begin{equation} \label{eq:market-price}
\text{If}\ Price\ Premium_t=0\ ,\ \ Market\ Price_t=Auctioned\ Price_t
\end{equation}

Where i is each auction case, t is each per month. If the auctioned price is discounted and surcharged compared to the general market price, the component can be separated as shown in (2), and if there is no discount and surcharge, it can be expressed as shown in (4). In order to estimate the price premium effect, which is the discount or surcharge, it can be defined in the Regression form as shown in (5), and it is assumed that the explanatory power of each component is as shown in (6).

In the Regression form in terms of effects,

\begin{equation} \label{eq:auction-sale-rate-in-regression}
Auction\ Sale\ Rate\ _t={\beta_0}_t{+\beta}_1EoM+\beta_2EoA_t+\ \beta_3EoP_t+\epsilon_t
\end{equation}

\begin{equation} \label{eq:explanatory-power}
\text{Explanatory Power of Each Components :} \\
EoM (Effect of Market Price) > EoA (Effect of Appraisal Price) > EoP (Effect of Price Premium)
\end{equation}

3.2. The data

The empirical analysis in this paper is based on Auction Sale Rate and Market Price Index in nationwide 2012.03 ~ 2022.10 in month. The auction sale rate is calculated by collecting the sum of court appraiser prices and auctioned prices nationwide announced by the court from 2012.03 to 2022.10. The Market price index is an index of general market apartment prices nationwide and is provided by the Korea Real Estate Board. Log-Differencing is taken in the Market price index to match the forms of both data equally then Standardization, which translates to mean 0 and variance 1, take both data to match the same scale.

Figure 1. Auction Sale Rate and Market Price Index

Figure 2. Comparison of Standardized Auction Sale Rate and Market Price Index (Log-differencing)

skewness and kurtosis reported in Table 1 shows AuctionSaleRate and MarketPriceIndex has different peaks and tails compared to normal distribution. and the Lev results in Table 1 show that it is different from the leverage effect (Black 1976.) of the stock market. The auction market and the general sales market has a positive sign relationship with the future volatility. This means that volatility in the real estate market has a positive correlation with price.

3.3. Identification of variables

3.3.1. The effect of market price

Auction sale rate can be decomposed into three components in the regression form as shown in (5), and log-differencing market price index is used as the first variable, EoM's proxy variable. As shown in Table 2, EoM has the strongest explanatory power in auction sale rate.

3.3.2. Component identification

\begin{equation} \label{eq:component-identification}
y_t=\beta_0+\beta_1Mkt_t+\epsilon_t
\end{equation}

Where y_t is Auction sale rate at time t, $\beta_0$ is intercept $\beta_1$ is parameter of $Mkt$ and $Mkt$ is Log differencing Market Price Index. as define in (5), the remaining EoA and EoP components are in the residual as latent. To identify EoA, EoP components, a Fourier transform is used in $\epsilon_t$ (7), and then two highest amplitude signals can be extracted, assuming that they are court appraisers and price premium effects as defined in (6).

3.3.2.1. Fourier transform

Fourier transform is a mathematical transformation that decomposes a function into a frequency component, representing the output of the transformation as a frequency domain. In this paper, it is used to extract the orthogonal cycle of EoA and EoP as defined in (5). In terms of linear transformation, the orthogonal factor present in the signal can be extracted as a Forward and Inverse Discreate Fourier matrix, as shown in (9).

\begin{equation} \label{eq:fft}
X=F_{N}x \ \text{and} \ x=\frac{1}{N}F_N^{-1}X\ \text{<Forward and Inverse>}
\end{equation}

\begin{equation} \label{eq:fft-in-matrix}
{\underbrace{\left[\begin{matrix}
X\left[0\right] \\
X\left[1\right] \\
\vdots \\
X\left[N-1\right] \\
\end{matrix}\right]}}_{Signal} \
= \
{\underbrace{\left[\begin{matrix}
W_N^{0\cdot0} & W_N^{0\cdot1} & \cdots & W_N^{0\cdot(N-1)} \\
W_N^{0\cdot1} & W_N^{0\cdot1} & \cdots & W_N^{1\cdot(N-1)} \\
\vdots & \vdots & \ddots & \vdots \\
W_N^{0\cdot1} & W_N^{0\cdot1} & \cdots & W_N^{(N-1)\cdot(N-1)} \\
\end{matrix}\right]}}_\text{$F_N$(Discrete Fourier Matrix)} \\
{\underbrace{\left[\begin{matrix}
x\left[0\right] \\
x\left[1\right] \\
\vdots \\
x\left[N-1\right] \\
\end{matrix}\right]}}_\text{Residual($\epsilon_t)$} \\
\text{, where } W^{n\cdot k}=\exp{\left(-j\frac{2\pi k}{N}n\right)}
\end{equation}

\begin{equation} \label{eq:signal-k}
X\left[k\right]=x\left[0\right]W^0+x\left[1\right]W^{N\times1}+\ldots+\ x\left[n-1\right]W^{i\times\left(n-1\right)} , \text{where} \ k=signal_k
\end{equation}

where $x$ is vector of $\epsilon$ in (7) $x=\left(x_0,x_1\ldots x_N\right)^T$ $N$ is length of vector and $X$ is signal $X=\left(X_0,X_1\ldots X_N\right)^T$ and $F_N$ is Discrete Fourier Matrix. As shown (9), (10) time series data which cyclic can be decomposed to orthogonal signal by Discrete Fourier Transform as linear transformation. However, in practice, DFT calculation $O(N^2)$ are replaced by Fast Fourier Transform (Cooley-Tukey algorithm, 1965) which is that performs fast calculations by dividing the DFT into odd and even two terms. $O\left({Nlog}_\ N\right)$ (11). Figure 3 shows that two high amplitude signals were extracted by performing FFT on Residual in (7).

\begin{equation} \label{eq:n-log-n}
\begin{split}
X\left[ k \right] & = \sum_{n=0}^{N-1} x_n \ exp \left( -j \frac{2 \pi k}{N} n \right) \\
& = \sum_{m=0}^{N/2-1}x_{2m}\exp{\left(-j\frac{2\pi k}{N}2m\right)}+\ \sum_{m=0}^{N/2-1}x_{2m+1}\exp{\left(-j\frac{2\pi k}{N}2m+1\right)} \\
& = \sum_{m=0}^{N/2-1}x_{2m}\exp{\left(-j\frac{2\pi k}{N\ /\ 2}\ m\ \right)}+\exp{\left(-j\frac{2\pi k}{N}\ \right)}\sum_{m=0}^{N/2-1}x_{2m+1}\exp{\left(-j\frac{2\pi k}{N/2}m\right)}
\end{split}
\end{equation}

where $x_{2m}=(x_0,x_1\ldots\ x_{n-2})$ is even-indexed part, $x_{2m+1}=(x_1,x_3,\ldots,x_{n-1})$ is odd-indexed part.

Figure 3-1. Transformed to Frequency Domain and Filtered by Amplitude

Figure 3-2. Transform Residual in (7) to FFT and extract signals

3.3.2.2. Regression analysis

\begin{equation} \label{eq:stage-2}
Y_t=\beta_0+\beta_1Mkt_t+\beta_2SI{G1}_t+\mu_t
\end{equation}

\begin{equation} \label{eq:stage-3}
Y_t=\beta_0+\beta_1Mkt_t+\beta_2SI{G1}_t+\beta_3\widehat{SIG2_t}+\omega_t
\end{equation}

\begin{equation} \label{eq:signal-2}
\widehat{Sig2_t}=\sigma\left(Sig2_t\right) , \ \sigma=\frac{1}{1+e^{-\left(x\right)}} , \ >\ 0.5\ =\ 1\ \ ,\ <0.5=\ 0
\end{equation}

where $SIG1$ is highest amplitude signal in residual in $\epsilon_t$ (7) and $SIG2$ is highest apmplitude signal residual in $\mu_t$ (12)

Table 2 shows the results of using the extracted signals as a variable of regression by performing FFT in 4.3.2.1. $SIG2$ is a component of EoP, and to distinguish price premium effects clearly, it is transformed into categorical data(0/1) through Sigmoid function as shown in (14). The Difference result in Table 2 show that the parameter has hardly changed, demonstrating that the two signals found are almost orthogonal components, and do not make omitted variable bias(Wooldridge, 2009). and the adj. R-squared supports the order of explanatory power assumed in (5). Lastly, the residual ACF/PACF plot in Figure 4 indicates that no further patterns exist in the residuals following the exclusion of the three components. (13) This supports the assumption outlined in 3.1 (5) that the auction sale rate is composed of three main components.

Figure 4. ACF/PACF Plot of Residual $\omega_t$ (13)

3.3.3. Proof of the effect of appraisal price

Based on Table 2 and according to the assumption of (5), $SIG1$ is EoA (Effect of Appraisal Price in Auction Sale Rate). The court appraisal time is in the past rather than the Auctioned time (1). The difference between the two points makes it difficult to define the court appraiser effect variable in terms of time series analysis. Since correcting the price difference that occurred in time for all auction cases is a very difficult challenge, the Fourier transform (4.3.2.1) is used. In this paper. Proving that $SIG1$ is EoA, 2,762 individual auction cases occurred between 2016.04 and 2018.03 in Seoul and Busan are empirically analyzed (Table 3, Table 4.)

Figure 5. The difference of time between Court Appraisal time and Auctioned time

The analysis is conducted in two main aspects:

Time interval between the time of court appraisal and the time of Auctioned (Table 4)
Regression with the general market price at the time of court appraisal price (Table 4)
\begin{equation} \label{eq:cp}
CP_t=\ \alpha_0dummy_t+\alpha_1MP_t+\gamma_t
\end{equation}

where $CP_t$ is price at time of court appraisal (Figure 5), MP is housing price, $\alpha_0$ is dummy variable $\alpha_1$is parameter of housing price.

Figure 6. Residual Distribution in (15) & The difference between Court Appraisal and Auctioned time (days)

As shown in Table 4, the time difference distribution has a right skewed shape and the range of 25% to 75% is about 7 to 11 months. Price difference has a long-tailed distribution, and it can be estimated that the court appraisal price and the housing price at the time of the court appraisal have a very high correlation and are almost the same value. To summarize the results of the two analyses, the court appraisal price is the lag variable of the housing price. In terms of the component (5) EoA can be assumed to have a lag relationship with $Mkt$ and the results are shown in Table 5.

Table 5. Regression of analysis ($SIG1$ vs $Mkt$)

Table 5 [1] shows the relationship between the lag variable of $SIG1$ and $Mkt$. $SIG1$ extracted by Fourier transform is compared with lag variable and $Mkt$ of $SIG1$ because it is a signal indicating the past influence of the present time rather than the past price itself. In addition, the order of the Lag of the comparison target is set from 7 months to 11 months, which ranges from 25% to 75% of Table 4 As a result of the analysis, it was confirmed that the lag variable of $SIG1$ has a significant relationship with $Mkt$.

Table 5 [2] is a confirmation of whether $Mkt$'s lag variable can replace the court appraiser if the court appraisal price has a time lag relationship with the $Mkt$ according to the results of Table 4 As a result of the analysis, there is a significant relationship.

Table 5 [3] confirms the relationship between $SIG1$ and Auction sale rate. If the court appraiser can be replaced by $Mkt$'s lag variable only, as in Table 5, the $SIG1$ variable is not meaningful, but the results of the analysis show that Table 5 [3] is superior to Table [2]. The reason for this is that, as in Figure 6, there is no special depreciation factor in each auction case, which can be explained by $Mkt$'s lag, but there is an unidentified area that has a large gap with $Mkt$, such as legal issues, equity auctions, or the time difference does not fall between 25% and 75%.

Figure 7 Lag of $Mkt$ can be only represented to part of identified

To sum up with Result of Table 5, in Table 4 $Mkt$ and $SIG1$ have lag relations with $Mkt$ and are superior to the lag variables of $Mkt$ according to the limits of Figure 7. therefore, $SIG1$ can be presumed in terms of EoA, as assumed in (5).

3.3.4. Proof of the effect of premium price

Based on result of Table 2 and according to the assumption of (5), $SIG2$ is EoP (Effect of Price Premium in Auction Sale Rate). For the analysis, $SIG2$ is transformed to categorical value through sigmoid function to assume Price premium on/off as in 4.3.2.2. In this paper, two aspects support that $SIG2$ is an EoP.

Verify that $\widehat{SIG2}$ can distinguish between discount and surcharge points. (Figure 8)
Track what variables $SIG2$ is, name it, and verify it makes sense.

3.3.4.1. Distinguish to price premium pffect in auction sale rate

The $\widehat{SIG2}$ parameter of Table 2 [3] is about 0.49 with a positive sign Figure 8 is based on the baseline predicted by Table 2 [2], and the auction sale rate points are clearly distinguished up and down by $\widehat{SIG2}$ 1/0 of Table [3]. The righthand side of Figure 8 shows a distribution of different means and variances. Therefore, $SIG2$ can be presumed in terms of EoP, as assumed in (5).

Figure 8. Surcharge and discount points that can be distinguished by $\widehat{SIG2}$

3.3.4.2. Momentum factor

In 4.3.4.1, it is confirmed that $SIG2$ is a component that can explain the price premium effect, but it is meaningless if it cannot be explained by any variable in practice. In this paper, $SIG2$ confirms which variables can be compared, verifies whether it makes sense, and finally names it. First, $SIG2$ is likely to be a variable of the auction market itself because it is likely that EoM and EoP already has the effects of macro in almost. In fact, no significant correlation was found between comparable macroeconomic variables. According to the Lev result of Table 1, the future volatility of the auction market has a positive correlation with the auction sale rate, The EoP component also has a positive correlation according to table 2 [3]. So, the variable that can be compared as a component of the auction market itself is volatility (16)(17). The results of the verification of this hypothesis is shown in Table 6.

\begin{equation} \label{eq:signal-2-2}
SIG2_t=\ c_0+c_1{v1}_t+c_2{v2}_t+\eta_t, \ {v1}_t = \left(y_t-y_{t-1}\right)_t , \ {v2}_t=\ \left(y_{t-1}-y_{t-2}\right)_t
\end{equation}

\begin{equation} \label{eq:signal-2-3}
SIG2_t=\ c_0+c_1\left(y_t-y_{t-1}\right)t+c_2\left(y{t-1}-y_{t-2}\right)_t+\eta_t
\end{equation}

where c_0 is intercept, y is auction sale rate, v is volatility as differencing of auction sale rate.

Figure 9. Compare to between $SIG2$ vs $\widehat{C_t^T} V_{t}$ (16)

Figure 10. Surcharge and discount points that can be distinguished by $\sigma(\widehat{C^T} V_t)$

In Table 6, the volatility variable is significantly related to $SIG2$, and in Table 6, the value described by
the volatility variable (16)(17) and $SIG2$ show similar movements. Figure 10 shows that the volatility variable can distinguish between the surcharge and discount points well and has different distribution like Figure 8.

In summary, the volatility variable of Auction sale rate can be explained as the main factor that creates the Price premium effect, and in particular, the reason why volatility causes the price premium effect can be interpreted as the reason that the volatility of the auction market has a positive correlation with the Auction sale rate. As a result, the volatility component can be named the momentum of the auction market.

3.3.5. Time varying beta to capture price premium section

In 4.3.4, it was confirmed that $SIG2$ extracted through Fourier transform is a price premium effect and verified that it is a momentum factor. However, the analysis period of this paper is about 10 years, and it would be more reasonable to assume time-varying than parameter between the market and the Price Premium variable has a fixed constant. It means that the $\beta s$ (18) is not stable over time. Sensitivity of beta can be used to capture the section where momentum works in the market, beyond simply distinguishing the effect of price premium. In this paper, a Kalman filter is used to estimate the time-varying beta and Kalman filter is used to estimate the time-varying parameter.

\begin{equation} \label{eq:betas-not-stable}
{y_t=\beta_0}_\ {+\beta}_1Mkt_t+\beta_2SIG1_t+\ \beta_3\ {\widehat{SIG2}}_t+\epsilon_t , \epsilon_t~N(0,\sigma^2)
\end{equation}

3.3.5.1. Kalman filter

The Kalman filter is a model for describing dynamics based on measurements and recursive procedure for computing the estimator of the unobserved component or the state vector at time t.

\begin{equation} \label{eq:state-model}
\xi_t=F_t\xi_{t-1}+q_t , q_t~N(0,Q_\ ) \ \text{<State Model>}
\end{equation}

\begin{equation} \label{eq:observation-model}
y_t=H_t\xi_t+r_t , r_t~N(0,R_\ ) \ \text{<Observation Model>}
\end{equation}

Calculate the optimal parameter of $\xi_{t|t-1}$, based on available information up to time $t-1$,

\begin{equation} \label{eq:xi-hat}
{\hat{\xi}}{t|t-1}=F_t{\hat{\xi}}{t-1|t-1}
\end{equation}

\begin{equation} \label{eq:covariance-xi}
P_t=F_tP_{t-1}F_t^T+Q_\
\end{equation}

\begin{equation} \label{eq:state-matrix}
F_t=H_tP_{t-1}H_t^T+R
\end{equation}

Calculate the optimal parameter of $\xi_{t|t}$, based on available information up to time $t$,

\begin{equation} \label{eq:kalman-gain}
K_t=P_{t|t-1}H^T{F_{t|t-1}^T}^{-1}
\end{equation}

\begin{equation} \label{eq:covariance-at-time-t}
P_{t|t}=\left(1-K_tH_t\right)P_{t|t-1}
\end{equation}

\begin{equation} \label{eq:xi-at-time-t}
{\hat{\xi}}{t|t}={\hat{\xi}}{t|t-1}-K_t\ r_{t|t-1}\
\end{equation}

The random walk effect is considered by assuming that Q, R is the initial value near 0 (= diffuse prior) and F is the diag (1,1,1,1) unit matrix and the Kalman gain (K) determines the weight for the new information using the information of the error between the prediction and the observation.

Figure 11. Beta (OLS) vs Beta (Kalman Filter) & Beta ($Mkt$) vs Beta ($\widehat{SIG2}$)

Figure 12. The Sensitivity points of EoP to the Auction Market

Table 8 shows that Time varying betas with Kalman filter performs better than the OLS with stable parameters. Figure 11 compares the change of the parameters of $\widehat{SIG2}$ and the change of the parameters of $Mkt$ at the same time. In Figure 12, if the parameter of $\widehat{SIG2}$ exceeds the upper confidence interval of OLS, it is set to 1 and plotted. In Figure 11, the area where $\widehat{SIG2}$ exceeds the beta of $Mkt$ and the area 1 of Figure 12 are the same, indicating that the price premium effect of the
auction market is more sensitive than the market price effect. This can be assumed to be an momentum interval, and the price premium effect is a sensitive interval.

3.3.5.2. Experiment

It is necessary to confirm whether the logic constructed so far works in the auction market in the region other than the whole country. Furthermore, when the model is performed by region, the characteristics of each region can be confirmed. The target areas of the empirical analysis are Seoul and gyeong-gi area where the auction market is most active.

Figure 13. (Seoul) $Mkt$ vs Auction Sale rate in seoul (left) Distinguished auction sale rate by EoP (Right)

Figure 14. (Seoul) Beta (OLS) vs Beta (Kalman Filter) & Beta ($Mkt$) vs Beta ($\widehat{SIG2}$)

Figure 15. (Seoul) The Sensitivity points of EoP to the Seoul Auction Market

Figure 16. (Gyeong-gi) $Mkt$ vs Auction Sale rate (Left) Distinguished auction sale rate by EoP (Right)

Figure 17. (Gyeong-gi) Beta (OLS) vs Beta (Kalman Filter) & Beta ($Mkt$) vs Beta ($\widehat{SIG2}$)

Figure 18. (Gyeong-gi) The Sensitivity points of EoP to the Seoul Auction Market

Table 8 and Figure 13 to Figure 18 are the results of the analysis of Seoul and Gyeonggi Province. Table 8 [2] Beta of $SIG2$ shows that Seoul is a more sensitive area than Gyeonggi-do in terms of price premium, and Figure 13-15 shows these resultswell. In particular, Seoul's Beta of EoP has far exceeded $Mkt$'s Beta since early 2020, supporting the general perception that overheating sentiment is forming in the Seoul area in the apartment auction market. On the contrary, the effect of EoP is relatively low in Gyeonggi-do. In addition, through the above results, it can be distinguished whether the outlier points existing in the auction sale rate of each region are the influence of EoP.

4. Conclusion

The previous auction market studies using bottom-up method mainly analyzed the variables affecting the Auction sale rate or had the disadvantage that the space and time were limited to the data they had. In this paper, time series analysis was carried out from the market perspective, and the top-down method using Fourier transform was attempted to solve the problem that the court appraiser price could not reflect the general market price at the time of the auction, and the price premium effect could be specified through the proof of each component.

In addition, it was found that the reason for making the price premium effect in the auction market is the momentum effect, and the time varying beta (Kalman filter) supports the above logic showing that the price premium effect can be divided by region. It is practically impossible to analyze a vast amount of auction cases for the analysis of the auction market, and this paper was very encouraging in that it provided many participants in the auction market with indicators that can be viewed from a market perspective.

However, it requires a deep understanding of the momentum factor. The sensitive activity of the momentum factor signifies not just market rises or falls, it indicates shifts in the price relationship between the auction and the general markets. Intuitively, when the real estate market heats up, high demand narrows the gap between general market prices and auction prices.

Therefore, the role of the momentum factor can be interpreted as representing the 'popularity' of the auction market compared to the general market. To elaborate further, it can serve as an indicator to judge whether the market is overheating or cooling down in comparison to the general market.

The additional insights of this study are as follows: Korea's apartment auction market has only momentum factors except for market prices under court appraiser control. Macro factors such as government regulations and interest rates are in the market price, so the third variable of the auction market is only the momentum factor, which can be very important information for many participants in the auction market.

This paper can be more rigorous if the following limitations are resolved. Since the monthly auction sale rate data may not be enough to support the rigor of the analysis, a wider analysis period or more time will further support the rigor of the analysis. In addition, the rigor of the analysis will be supported if more data on the unidentified area can be obtained in the process of proving the appraiser component of the court.

References

[1] Arslan, Y., Guler, B. & Taskin, T(2015), “Joint dynamic of house prices and foreclosures,”

[2] Journal of Money, Credit and Banking, 47(1), 133-169.

[3] Clauretie, T.M., Daneshvary, N.,(2009). “Estimating the house foreclosure discount corrected for spatial price interdependence and endogeneity of marketing time,” Real Estate Economics. 37 (1), 43-67.

[4] Campbell, J.Y., Giglio, S., Pathak, P.,(2011). “Forced sales and house prices,” American Economic Review. 101 (5), 2108-2131.

[5] Forgey, F.A., Rutherford, R.C., VanBuskirk, M.L.,(1994). “Effect of foreclosure status on residential selling price,” Journal of Real Estate Research. 9 (3), 313-318.

[6] Jin, (2010). Is the Selling Price Discounted at the Real Estate Auction Market? Housing Studies Review, 18(3), 93-117.

[7] Lee, (2009). True Auction Price Ratio for Condominium: The Case of Gangnam Area, Seoul, Korea. Housing Studies Review, 17(4), 233-258.

[8] Lee, (2012). Anomalies in Real Estate Markets: A Survey. Housing Studies Review, 20(3), 5-40.

[9] Mergner, S. (2009). Applications of State Space Models in Finance (pp. 17-40). Universitätsverlag Göttingen.

[10] Oh, (2021). A study on influencing factors for auction successful bid price rate of apartments in Seoul area Journal of the Korea Real Estate Management Review, 23, 99-119.

[11] Shilling, J.D., Benjamin, J.D., Sirmans, C.F.,(1990). “Estimating net realizable value for distressed real estate,” Journal of Real Estate Research. 5 (1), 129-140.

[12] Springer, T.M.,(1996). “Single-family housing transactions: seller motivations, price, and marketing time,” Journal of Real Estate Finance Economics. 13 (3), 237-254.

[13] Wooldridge, J. M. (2015). Introductory econometrics: A modern approach (pp. 83-91). Cengage Learning.

[14] Zhou, H., Yuan, Y., Lako, C., Sklarz, M., McKinney, C.,(2015). “Foreclosure discount: definition and dynamic patterns,” Real Estate Economics. 43 (3), 683-718.

[15] Zhou, Y., Cao, W., Liu, L., Agaian, S., & Chen, C. P. (2015). Fast Fourier transform using matrix decomposition. Information Sciences, 291, 172-183.

Interpretable Topic Analysis

Published

2023-09-22 12:00

Mincheol Kim*

* Swiss Institute of Artificial Intelligence, Chaltenbodenstrasse 26, 8834 Schindellegi, Schwyz, Switzerland

Abstract

User-generated data, often characterized by its brevity, informality, and noise, poses a significant challenge for conventional natural language processing techniques, including topic modeling. User-generated data encompasses informal chat conversations, Twitter posts laden with abbreviations and hashtags, and an excessive use of profanity and colloquialisms. Moreover, it often contains "noise" in the form of URLs, emojis, and other forms of pseudo-text that hinder traditional natural language processing techniques.

This study sets out to find a principled approach to objectively identifying and presenting improved topics in short, messy texts. Topics, the thematic underpinnings of textual content, are often "hidden" within the vast sea of user-generated data and remain "undiscovered" by statistical methods, such as topic modeling.

We explore innovative methods, building upon existing work, to unveil latent topics in user-generated content. The techniques under examination include Latent Dirichlet Allocation (LDA), Reconstructed LDA (RO-LDA), Gaussian Mixture Models (GMM) for distributed word representations, and Neural Probabilistic Topic Modeling (NPTM).

Our findings suggest that NPTM exhibits a notable capability to extract coherent topics from short and noisy textual data, surpassing the performance of LDA and RO-LDA. Conversely, GMM struggled to yield meaningful results. It is important to note that the results for NPTM are less conclusive due to its extended computational runtime, limiting the sample size for rigorous statistical testing.

This study addresses the task of objectively extracting meaningful topics from such data through a comparative analysis of novel approaches.

Also, this research contributes to the ongoing efforts to enhance topic modeling methodologies for challenging user-generated content, shedding light on promising directions for future investigations.
This study presents a comprehensive methodology employing Graphical Neural Topic Models (GNTM) for textual data analysis. "Group information" here refers to topic proportions (theta). We applied a Non-Linear Factor Analysis (FA) approach to extract this intricate structure from text data, similar to traditional FA methods for numerical data.

Our research showcases GNTM's effectiveness in uncovering hidden patterns within large text corpora, with attention to noise mitigation and computational efficiency. Optimizing topic numbers via AIC and agglomerative clustering reveals insights within reduced topic sub-networks.
Future research aims to bolster GNTM's noise handling and explore cross-domain applications, advancing textual data analysis.

PDF View PDF Download

1. Introduction

Over the past few years, the volume of news information on the Internet has seen exponential growth. With news consumption diversifying across various platforms beyond traditional media, topic modeling has emerged as a vital methodology for analyzing this ever-expanding pool of textual data. This introduction provides an overview of the field and the seminal work of foundations.

1.1 Seminal work: topic modeling research

One of the pioneering papers in news data analysis using topic modeling is "Latent Dirichlet Allocation" ,that is, LDA technique, which revolutionized the extraction and analysis of topics from textual data.

The need for effective topic modeling in the context of the rapidly growing user-generated data landscape has been emphasized. The challenges posed by short, informal, and noisy text data, including news articles, are highlighted.

There are numerous advantages of employing topic modeling techniques for news data analysis, including:

Topic derivation for understanding frequent news coverage.
Trend analysis for tracking news trends over time.
Identifying correlations between news topics.
Automated information extraction and categorization.
Deriving valuable insights for decision-making.

Recent advancements in the fusion of neural networks with traditional topic modeling techniques have propelled the field forward. Papers such as "Neural Topic Modeling with Continuous Neighbors" have introduced innovative approaches that warrant exploration. By harnessing deep learning and neural networks, these approaches aim to enhance the accuracy and interpretability of topic modeling.

Despite the growing importance of topic modeling, existing topic modeling methods do not sufficiently consider the context between words, which can lead to difficult interpretation or inaccurate results. This limits the usability of topic modeling. The continuous expansion of text documents, especially news data, underscores the urgency of exploring its potential across various fields. Public institutions and enterprises are actively seeking innovative services based on their data.

To address the limitations of traditional topic modeling methods, this paper proposes the Graphical Neural Topic Model (GNTM). GNTM integrates graph-based neural networks to account for word dependencies and context, leading to more interpretable and accurate topics.

1.2 Research objectives

This study aims to achieve the following objectives:

Present a novel methodology for topic extraction from textual data using GNTM.
Explore the potential applications of GNTM in information retrieval, text summarization, and document classification.
Propose a topic clustering technique based on GNTM for grouping related documents.

In short, the primary objectives are to present GNTM's capabilities, explore its applications in information retrieval, text summarization, document classification, and propose a topic clustering technique.

The subsequent sections of this thesis delve deeper into the methodology of GNTM, experimental results, and the potential applications in various domains. By the conclusion of this research, these contributions are expected to provide valuable insights into the efficient management and interpretation of voluminous document data in an ever-evolving information landscape.

2. Problem definition

2.1 Existing industry-specific keywords analysis

South Korea boasts one of the world's leading economies, yet its reliance on foreign demand surpasses that of domestic demand, rendering it intricately interconnected with global economic conditions[3]. This structural dependency implies that even a minor downturn in foreign economies could trigger a recession within Korea if the demand for imports from developed nations declines. In response, public organizations have been established to facilitate Korean company exports worldwide.

However, the efficacy of these services remains questionable, with South Korea's exports showing a persistent downward trajectory and a trade deficit anticipated for 2022. The central issue lies in the inefficient handling of global textual data, impeding interpretation and practical application.

Figure 1b*. Industry-specific keywords: *Data service provided by public organization

Han, G.J(2022) scrutinized the additional features and services available to paid members through the utilization of big data and AI capabilities based on domestic logistics data[5]: Trade and Investment Big Data (KOTRA), Korea Trade Statistics Information Portal (KTSI), GoBiz Korea (SME Venture Corporation), and K-STAT (Korea Trade Association).

Regrettably, these services predominantly offer basic frequency counts, falling short of delivering valuable insights. Furthermore, they are confined to providing internal and external statistics, rendering their output less practical. While BERT and GPT have emerged as potential solutions, these models excel in generating coherent sentences rather than identifying representative topics based on company and market data and quantifying the distribution of these topics.

2.2 Proposed model for textual data handling

To address the challenge of processing extensive textual data, we introduce a model with distinct characteristics:

Extraction of information from data collected within defined timeframes.
A model structure producing interpretable outcomes with traceable computational pathways.
Recommendations based on the extracted information.

Previous research mainly relied on basic statistics to understand text data. However, these methods have limitations, such as difficulty in determining important topics and handling large text sets, making it hard for businesses to make decisions.

Our research introduces a method for the precise extraction and interpretation of textual data meaning via a natural language processing model. Beyond topic extraction, the model will uncover interrelationships between topics, enhance text data handling efficiency, and furnish detailed topic-related insights. This innovative approach promises to more accurately capture the essence of textual data, empowering companies to formulate superior strategies and make informed decisions.

2.3 Scope and contribution

This study concentrates on the extraction and clustering of topics from textual data derived from numerous companies' news data sources.

However, its scope is confined to outlining the methodology for collecting news data from individual firms, extracting topic proportions, and clustering based on these proportions. We explicitly state the study's limitations concerning the specific topics under investigation to bolster the research's credibility. For instance, we may refrain from delving deeply into a particular topic and clarify the constraints on the generalizability of our findings.

The proposed methodology in this study holds the potential to facilitate the effective handling and utilization of this vast text data reservoir. Furthermore, if this methodology is applied to Korean exporters, it could play a pivotal role in transforming existing export support services and mitigating the recent trade deficit.

3. Literature review

3.1 Non-graph-based method

3.1.1 Latent Dirichlet Allocation (LDA)

LDA, a classic topic modeling technique, discovers hidden topics within a corpus by assigning words to topics probabilistically[2]. It uncovers hidden 'topics' within a corpus by probabilistically assigning words in documents to these topics. Each document is viewed as a mixture of topics, and each topic is characterized by a distribution of words and topic probabilities.

\[p(d|\alpha,\beta^v_{z_n}) = \int{p(\theta_d|\alpha)} \prod_{n} \sum_{z_n} p(w_{d,n}|z_n,\beta^v_{z_n})p(z_n|\theta_d)d\theta_d \]

where $\beta$ is $k\times V$ topic-word matrix. $p(w_{d,n}|z_n,\beta^v_{z_n})$ is probability for word $w_{d,n}$ to happen when topic is $z_n$.

However, LDA has a limitation known as the "independence" problem. It treats words as independent and doesn't consider their order or relationships within documents. This simplification can hinder LDA's ability to capture contextual dependencies between words. To address this, models like Word2Vec and GloVe have been developed, taking word order and dependencies into account to provide more nuanced representations of textual data.

3.1.2 Latent Semantic Analysis (LSA)

LSA is a method to uncover the underlying semantic structure in textual data. It achieves this by assessing the semantic similarity between words using document-word matrices[4]. LSA's fundamental concept involves recognizing semantic connections among words based on their distribution within a document. To accomplish this, LSA relies on linear algebra techniques, particularly Singular Value Decomposition (SVD), to condense the document-word matrix into a lower-dimensional representation. This process allows semantically related words or documents to be situated in proximity within this reduced space.

\[X=U\Sigma V^T\]

\[Sim(Q,X)=R=Q^T X\]

where $X$ is $t \times d$ matrix, a collection of d documents in a space of t dictionary terms. $Q$ is $t \times q$ matrix, a collection of q documents in a space of t dictionary terms.

$U$ is term eigenvectors and $V$ is document eigenvectors.

LSA, an early form of topic modeling, excels at identifying semantic similarities among words. Nonetheless, it has its limitations, particularly in its inability to fully capture contextual information and word relationships.

3.1.3 Neural Topic Model (NTM)

Traditional topic modeling has limitations, including sensitivity to initialization and challenges related to unigram topic distribution. The Neural Topic Model (NTM) bridges topic modeling and deep learning, aiming to enhance word and document representations to overcome these issues.

At its core, NTM seamlessly combines word and document representations by embedding topic modeling within a neural network framework. While preserving the probabilistic nature of topic modeling, NTMs represent words and documents as vectors, leveraging them as inputs for neural networks. This involves mapping words and documents into a shared latent space, accomplished through separate neural networks for word and document vectors, ultimately leading to the computation of the topic distribution.

The computational process of NTM includes training using back-propagation and inferring topic distribution through Bayesian methods and Gibbs sampling.

\[p(w|d) = \sum^K_{i=1} p(w|t_i)p(t_i|d)\]

where $t_i$ is a latent topic and $K$ is the pre-defined topic number. Let $\pi(w) = [p(w|t_1), \dot , p(w|t_K)]$ and $\theta(d) = [p(t_1|d), \dot, p(t_K|d)]$, where $\pi$ is shared among the corpus and $\theta$ is document-specific.

Then above equation can be represented as the vector form:

\[p(w|d) = \phi(w) \times \theta^T(d) \]

3.2 Graph-based methods

3.2.1 Global random topic field

To capture word dependencies within a document, the graph structure incorporates topic assignment relationships among words to enhance accuracy[9].

GloVe-derived word vectors are mapped to Euclidean space, while the document's internal graph structure, identified as the Word Graph, operates in a non-Euclidean domain. This enables the Word Graph to uncover concealed relationships that traditional Euclidean numerical data representation cannot reveal.

Calculating the "structure representing word relationships" involves employing a Global Random Field (GRF) that encodes the graph structure in the document using topic weights of words and the topic connections in the graph's edges. The GRF formula is as follows:

\[ p(G) = f_G (g) = \frac{1}{|E|} \phi(z_W) \sum {(w', w'') \in E} \phi(z{w'}, z_{w''}) \]

The above-described Global Topic-Word Random Field (GTRF) shares similarities with the GRF. In the GTRF, the topic distribution (z) becomes a conditional distribution on $theta$. Learning and inferring in this model closely resemble the EM algorithm. The outcome, denoted as $p_{GTRF}(z|\theta)$, represents the probability of the graph structure considering whether neighboring words (w' and w'') are assigned to the same topic or different topics. This is expressed as:

\[ p_{GTRF}(z|\theta) = \frac{1}{|E|} Multi(z_W|\theta) \times \sum {(w', w'') \in E} (\sigma{z_{w'} = z_{w''}}\lambda_1 + \sigma_{z_{w'} \neq z_{w''}}\lambda_2) \]

Where $\sigma_{z}$ is a function that returns 1 if the condition $x$ is true and 0 if $x$ is false.

3.2.2 GraphBTM

While LDA encounters challenges related to data sparsity, particularly when modeling short texts, the Biterm Topic Model (BTM) faces limitations in its expressiveness, especially when dealing with documents containing diverse topics[13]. Additionally, BTM relies on bitwords in conjunction with the co-occurrence features of words, which restricts its suitability for modeling longer texts.

To address these limitations, the Graph-Based Biterm Topic Model (GraphBTM) was developed. GraphBTM introduces a graphical representation of biterms and employs Graph Convolutional Networks (GCN) to extract transitive features, effectively overcoming the shortcomings associated with traditional models like LDA and BTM.

GraphBTM's computational approach relies on Amortized Variational Inference. This method involves sampling a mini-corpus to create training instances, which are subsequently used to construct graphs and apply GCN. The inference network then estimates the topic distribution, which is vital for training the model. Notably, this approach has demonstrated the capability to achieve higher topic consistency scores compared to traditional Auto-Encoding Variational Bayes (AEVB)-based inference methods.

3.2.3 Graphical Neural Topic Model (GNTM)

LDA, in its conventional form, makes an assumption of independence. It posits that each document is generated as a blend of topics, with each topic representing a distribution over the words within the document. However, this assumption of conditional independence, also known as exchangeability, overlooks the intricate relationships and context that exist among words in a document.

The No Variational Inference (NVI) algorithm presents a departure from this independence assumption. NVI is a powerful technique for estimating the posterior distribution of latent topics in text data. It leverages a neural network structure, employing a reparameterization trick to accurately estimate the genuine posterior distribution for a wide array of distributions.

\[\alpha(prior) \rightarrow z(topic) \: from \: \theta \rightarrow G_d(structure) \rightarrow V(word \: set) \]

\[p(G^0_d|Z_d;M) = \prod_{(n,n') \in E^0_d} m_{z_{d,n}}{z_{d,n'}} \prod_{(n,n') \notin E^0_d} (1-m_{z_{d,n}}{z_{d,n'}})\]

\[p(G_d, \theta_d, Z_d;\alpha) = p(V_d|Z_d,G^0_d)p(G^0_d|Z_d)\prod^{N_d}_{n=1} p(z_{d,n}|\theta_d)p(\theta|\alpha) \]

Unlike the Variational Autoencoder (VAE), which is primarily employed for denoising and data restoration and can be likened to an 'encoder + decoder' architecture, NVI serves a broader purpose and can handle a more extensive range of distributions. It's based on the mean-field assumption and employs the Laplace approximation method, replacing challenging distributions like the Dirichlet distribution with the computationally efficient logistic normal distribution[8].

Based mean field assumption:

\[q(\theta_d,Z_d|G_d) = q(\theta_d|G_d;\mu_d, \delta_d) \prod^{N_d}_{n=1} q(z_{d,n}|G_d,w_d,n;\varphi_{d,n})\]

\[L_d = E_{q(Z_d|G_d)} [log p(G^0_d|Z_d;M) + logp(V_d|Z_d, G^0_d;\beta)] - KL[q(\theta_d|G_d)||p(\theta_d)] - E_{q(\theta_d|G_d)}\sum^{N_d}_{n=1} KL[q(z_{d,n}|G_d, w_{d,n})||p(z_{d,n}|\theta_d)]
\]

This substitution simplifies parameter estimation, making it more tractable and readily differentiable. In the context of the Global Neural Topic Model (GNTM), the logistic normal distribution facilitates the approximation of correlations between latent variables, allowing for the utilization of dependencies between topics. Additionally, the Evidence Lower Bound (ELBO) in NVI is differentiable in closed-form, enhancing its applicability.

The concept of topic proportion is represented by the equation:

\[\theta_d = \text{softmax}(N(\mu_d, \delta_d^2))\]

\[f_X(x;\mu,\sigma) = \frac{1}{\sigma \sqrt{2\pi}}e^{\frac{(logit(x)-\mu)^2}{2\sigma^2}}\frac{1}{x(1-x)}\]

This equation encapsulates the distribution of topics within a document, reflecting the proportions of different topics in that document.

Figure 2. Transformation of logit-normal distribution after conversion

3.3 Visualization techniques

3.3.1 Fast unfolding of communities in large networks

This algorithm aids in detecting communities within topic-words networks, facilitating interpretation and understanding of topic structures.

3.3.2 Uniform Manifold Approximation and Projection (UMAP)

UMAP is a nonlinear dimensionality reduction technique that preserves the underlying structure and patterns of high-dimensional data while efficiently visualizing it in lower dimensions. It outperforms traditional methods like t-SNE in preserving data structure.

3.3.3 Agglomerative Hierarchical Clustering

Hierarchical clustering is an algorithm that clusters data points, combining them based on their proximity until a single cluster remains. It provides a dynamic and adaptive way to maintain cluster structures, even when new data is added.

Additionally, several evaluation metrics, including the Silhouette score, Calinski-Harabasz index, and Davies-Bouldin index, assist in selecting the optimal number of clusters for improved data understanding and analysis.

4. Method

4.1 Graphical Neural Topic Model(GNTM) as Factor analysis

GNTM can be viewed from a factor analysis perspective, as it employs concepts similar to factor analysis to unveil intricate interrelationships in data and extract topics. GNTM can extract $\theta$, which signifies the proportion of topics in each document, for summarizing and interpreting document content. In this case, $\theta$ follows a logistic normal distribution, enabling the probabilistic modeling of topic proportions.

The $\theta$ can be represented as follows[1][7]:

\[ \tilde{\theta} \sim \text{LN}(\mu, \sigma^2) \]

For $0 < \tilde{x} < 1$ and $\sum_i^K x_i = 1$:

\[ y = [\log(\frac{x_1}{x_D}), ..., \log(\frac{x_{D-1}}{x_D})]^T \]

Probability Density Function (PDF) for $X$:

\[ f_X(x; \mu, \Sigma) = \frac{1}{|2 \pi \Sigma|^{\frac{1}{2}}} \frac{1}{\prod^K_{i=1} x_i (1-x_i)} e^{-\frac{1}{2} \{ \log (\frac{x}{1-x}) - \mu \} \Sigma^{-1} \{ \log(\frac{x}{1-x}) - \mu \}} \]

where the log and division in the argument are element-wise. This is due to the diagonal Jacobian matrix of the transformation with elements $\frac{1}{{x_i}{(1-x_i)}}$

GNTM shares similarities with factor analysis, which dissects complex data into factors associated with each topic to unveil the data's structure. In factor analysis, the aim is to explain observed data using latent factors. Similarly, GNTM treats topics in each document as latent variables, and these topics contribute to shaping the word distribution in the document. Consequently, GNTM decomposes documents into combinations of words and topics, offering an interpretable method for understanding document similarities and differences.

4.2 Akaike Information Criteria (AIC)

The Akaike Information Criterion (AIC) is a crucial statistical technique for model selection and comparison, evaluating the balance between a model's goodness of fit and its complexity. AIC aids in selecting the most appropriate model from a set of models.

In the context of this thesis, AIC is employed to assess the fit of a Graphical Network Topic Model (GNTM) and determine the optimal model. Since GNTMs involve parameters related to the number of topics in topic modeling, selecting the appropriate number of topics is a significant consideration. AIC assesses various GNTM models based on the choice of the number of topics and assists in identifying the most suitable number of topics.

AIC can be represented by the following formula:

\[ AIC = -2 \cdot \text{log-likelihood} + 2 \cdot \text{number of parameters} \]

Where:

The $\text{log-likelihood}$ is a measure of the goodness of fit of the model to explain the data.
Number of parameters indicates the count of parameters in the model.

AIC weighs the tradeoff between a model's log-likelihood and the number of parameters, which reflects the model's complexity. Lower AIC values indicate better data fit while favoring simpler models. Therefore, the model with the lowest AIC is considered the best. AIC plays a pivotal role in enhancing the quality of topic modeling in GNTM by assisting in managing model complexity when choosing the number of topics.

For our current model, following a Logistic Normal Distribution, we utilize GNTM's likelihood:

\[ L(\theta| D) = \prod_{d=1}^D \left[-\frac{1}{2} \log(|2 \pi \Sigma|) - \sum_{i=k}^K (\log\theta_i - \log(1-\theta_i)) - \frac{1}{2} \left\{ \log \left(\frac{\theta}{1-\theta}\right) - \mu \right\} \Sigma^{-1} \left\{ \log \left(\frac{\theta}{1 - \theta}\right) - \mu \right\}\right] \]

When applied to a formula, it appears as:

\[ AIC = -2 \cdot l(\theta) + 2 \cdot \text{number of topics} \]

Where:

\[ l(\theta) = \sum_{d=1}^D [ -\frac{1}{2}\log (|2\pi \Sigma|) - \sum_{k=1}^K \log(\theta_k (1 - \theta_k)) + -\frac{1}{2} (\log(\frac{\theta}{1-\theta}) - \mu_i)^T \Sigma^{-1} (\log(\frac{\theta}{1-\theta}) - \mu_i)] \]

This encapsulates the essence of GNTM and AIC in evaluating and selecting models.

5. Result

5.1 Model setup

5.1.1 Data

The data consists of news related to the top 200 companies by market capitalization on the NASDAQ stock exchange. These news articles were collected by crawling Newsdata.io in August. Analyzing this data can provide insights into the trends and information about companies that occurred in August. Having a specific timeframe like August helps in interpreting the analysis results clearly.

To clarify the research objectives, companies with fewer than 10 articles collected were excluded from the analysis. Additionally, a maximum of 100 articles per company was considered. As a result, a total of 13,896 documents were collected, and after excluding irrelevant documents, 13,816 were used for the analysis. The data format is consistent with the "20 News Groups" dataset, and data preprocessing methods similar to those in Shen(2021)[10] were applied. This includes steps like removing stopwords, abbreviations, punctuation, tokenization, and vectorization. You can find examples of the data in the Appendix.

5.1.2 Parameters

"In our experiments, as the dataset contained a large number of words and edges, it was necessary to reduce the number of parameters for training while minimizing noise and capturing important information. To achieve this, we set the threshold for the number of words and edges to 140 and 40, respectively, which is consistent with the configuration used in the BNC dataset, a similar dataset. The experiments were conducted in an RTX3060 GPU environment using the CUDA 11.8 framework, with a batch size of 25. To determine the optimal number of topics, we calculated and compared AIC values for different numbers of topics. Based on the comparison of AIC values, we selected 20 as the final number of topics."

5.2 Evaluation

5.2.1 AIC

Figure 3. Changes in AIC values depending on the number of topics

AIC is used in topic modeling as a tool to select the optimal number of topics. However, AIC is a relative number and may vary for different data or models. Therefore, when using AIC to determine the optimal number of topics, it is important to consider how this metric applies to your data and model.

In our study, we calculated the AIC for a given dataset and model architecture and used it to select the optimal number of topics. This approach served as an important metric for finding the best number of topics for our data. The AIC was used to evaluate the goodness of fit of our model, allowing us to compare the performance of the model for different numbers of topics.

Additionally, AIC allows us to evaluate the performance of our model in comparison to AICs obtained from other models or other datasets. This allows us to determine the relative superiority of our model and highlights that we can perform optimized hyperparameter tuning for our own data and model, rather than comparing to other models. This approach is one of the key strengths of our work, contributing to a greater emphasis on the effective utilization and interpretation of topic models.

5.2.2 Topic interpretation

5.2.3 Classification

Figure 4b*. 30 Topics graph: *The result of Agglomerative Clustering

In our study, we leveraged Agglomerative Clustering and UMAP to classify and visualize news data. In our experiments, we found that news is generally better classified when the number of topics is 10. These results suggest that the model is able to group and interpret the given data more effectively.

However, when the number of topics is increased, broader topics tend to be categorized into more detailed topics. This results in news content being broken down into relatively more detailed topics, but the main themes may not be more apparent.

Figure 5c*. UMAP graph with 30 topics: *The result of Agglomerative Clustering

Also, as the number of topics increases, the difference in the proportion of topics that represent the nature of the news increases. This indicates a hierarchy between major and minor topics, which can be useful when you want to fine-tune your investigation of different aspects of the news. This diversity provides important information for detailed topic analysis in context.

Therefore, when choosing the number of topics, we need to consider the balance between major and minor topics. By choosing the right number of topics, the model can best understand and interpret the given data, and we can tailor the results of the topic analysis to reflect the key features of the news content.

6. Discussion

6.1 Limitation

Even though this paper has contributed to addressing various challenges related to textual data analysis, it is essential to acknowledge some inherent limitations in the proposed methodology:

Noise Edges Issue
The modeling approach used in this paper introduces a challenge related to noise edges in the data, which can be expected when dealing with extensive corpora or numerous documents from various sources.
To effectively mitigate this noise issue, it is crucial to implement regularization techniques tailored to the specific objectives and nature of the data. Approaches such as the one proposed by Zhu et al. (2023)[12] enhanced the model’s performance by more efficiently discovering hidden topic distributions within documents.}
Textual Data Versatility
While this paper focuses on extracting and utilizing the topic latent space from text data, it is worth noting that textual data analysis can have diverse applications across various fields.
In addition to hierarchical clustering, there is potential to explore alternative recommendation models, such as Matrix Factorization methods like NGCF(Neural Graph Collaborative Filtering)[11]{Wang2019} and LightGCN(Light Graph Convolutional Network)[6], which utilize techniques like Graph Neural Networks(GNN) for enhancing recommendation performance.

Acknowledging these limitations is essential for a comprehensive understanding of the proposed methodology's scope and areas for potential future research and improvement.

6.2 Future work

While this study has made significant strides in addressing key challenges in the analysis of textual data and extracting valuable insights through topic modeling, there remain several avenues for future research and improvement:

Enhanced Noise Handling
The modeling used has shown promise but is not immune to noise edge issues often encountered in extensive datasets. In this study, we used a dataset comprising approximately 9,000 news articles from 194 countries, totaling around 5 million words. To mitigate these noise edge issues effectively, future work can focus on developing advanced noise reduction techniques or data preprocessing methods tailored to specific domains, further enhancing the quality of extracted topics and insights.
Cross-Domain Application
While the study showcased its effectiveness in the context of news articles, extending this approach to other domains presents an exciting opportunity. Adapting the model to different domains may require domain-specific preprocessing and feature engineering, as well as considering transfer learning approaches. Models based on Graph Neural Networks (GNN) and Matrix Factorization, such as Neural Graph Collaborative Filtering (NGCF) and LightGCN, can be employed to enhance recommendation systems and knowledge discovery in diverse fields. This cross-domain versatility can unlock new possibilities for leveraging textual data to extract meaningful insights and improve decision-making processes across various industries and research domains.

7. Conclusion

In the context under discussion, the term "group information" pertains to the topic proportions represented by theta. From my perspective, I have undertaken an endeavor that can be characterized as Non-Linear Factor Analysis (FA) applied to textual data, analogous to traditional FA methods employed with numerical data. This undertaking proved intricate due to the inherent non-triviality in its extraction, thus warranting the classification as Non-Linear FA. (Indeed, there exists inter-topic covariance.)

Hitherto, the process has encompassed the extraction of information from textual data, a task which may appear formidable for utilization. This encompasses the structural attributes of words and topics, the proportions of topics, as well as insights into the prior distribution governing topic proportions. These constituent elements have facilitated the quantitative characterization of information within each group.

A central challenge encountered in the realm of conventional Principal Component Analysis (PCA) and FA techniques lies in the absence of definitive answers, given our inherent limitations. Consequently, the interpretation of the extracted factors poses formidable challenges and lacks assuredness. However, the GNTM methodology applied to this paper, in tandem with textual data, furnishes a network of words for each factor, thereby affording a means for expeditious interpretation.

If the words assume preeminence within Topic 1, they afford a basis for interpretation. This alignment with the intentions of the GNTM. In effect, this model facilitates the observation of pivotal terms within each topic (factor) and aids in the explication of their conceptual representations.

This research has presented a comprehensive methodology for the analysis of textual data using Graphical Neural Topic Models (GNTM). The paper discussed how GNTM leverages the advantages of both topic modeling and graph-based techniques to uncover hidden patterns and structures within large text corpora. The experiments conducted demonstrated the effectiveness of GNTM in extracting meaningful topics and providing valuable insights from a dataset comprising news articles.

In conclusion, this research contributes to advancing the field of textual data analysis by providing a powerful framework for extracting interpretable topics and insights. The combination of GNTM and future enhancements is expected to continue facilitating knowledge discovery and decision-making processes across various domains.

Nevertheless, a pertinent concern arises about inordinate amount of noise pervade newspaper data or all data. Traditional methodologies employ noise mitigation techniques such as Non-Negative Matrix Factorization (NVI) and the execution of numerous epochs for the extraction of salient tokens. In the context of this research, as aforementioned, the absence of temporal constraints allowed for the execution of epochs as deemed necessary.

However, computational efficiency was bolstered through the reduction in the number of topics, while remaining the primary objectives from a clustering perspective by finding out the optimized number of topic by AIC and agglomerative clustering. This revealed that a reduction in the number of topics resulted in the observation of words associated with the original topics within sub-networks of the diminished topics.

Future research can further enhance the capabilities of GNTM by improving noise handling techniques and exploring cross-domain applications.

References

[1] Atchison, J., and Shen, S. M. Logistic-normal distributions: Some properties and uses.
Biometrika 67, 2 (1980), 261–272.

[2] Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent dirichlet allocation. Journal of machine
Learning research 3, Jan (2003), 993–1022.

[3] Choi, M. J., and Kim, K. K. Import demand in developed economies. In Economic Analysis
(Quarterly) (2019), vol. 25, Economic Research Institute, Bank of Korea, pp. 34–65.

[4] Evangelopoulos, N. E. Latent semantic analysis. Wiley Interdisciplinary Reviews: Cognitive
Science 4, 6 (2013), 683–692.

[5] Han, K. J. Analysis and implications of overseas market provision system based on domestic
logistics big data. KISDI AI Outlook 2022, 8 (2022), 17–30.

[6] He, X., Deng, K., Wang, X., Li, Y., Zhang, Y., and Wang, M. Lightgcn: Simplifying and
powering graph convolution network for recommendation. In Proceedings of the 43rd International
ACM SIGIR conference on research and development in Information Retrieval (2020), pp. 639–
648.

[7] Hinde, J. Logistic Normal Distribution. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011,
pp. 754–755.

[8] Kingma, D. P., and Welling, M. Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114 (2013).

[9] Li, Z., Wen, S., Li, J., Zhang, P., and Tang, J. On modelling non-linear topical dependencies.
In Proceedings of the 31st International Conference on Machine Learning (Bejing, China,
22–24 Jun 2014), E. P. Xing and T. Jebara, Eds., vol. 32 of Proceedings of Machine Learning
Research, PMLR, pp. 458–466.

[10] Shen, D., Qin, C., Wang, C., Dong, Z., Zhu, H., and Xiong, H. Topic modeling revisited:
A document graph-based neural network perspective. Advances in neural information processing
systems 34 (2021), 14681–14693.

[11] Wang, X., He, X., Wang, M., Feng, F., and Chua, T.-S. Neural graph collaborative
filtering. In Proceedings of the 42nd International ACM SIGIR Conference on Research and
Development in Information Retrieval (jul 2019), ACM.

[12] Zhu, B., Cai, Y., and Ren, H. Graph neural topic model with commonsense knowledge.
Information Processing Management 60, 2 (2023), 103215.

[13] Zhu, Q., Feng, Z., and Li, X. Graphbtm: Graph enhanced autoencoded variational inference
for biterm topic model. In Proceedings of the 2018 conference on empirical methods in natural
language processing (2018), pp. 4663–4672.

Appendix

News Data Example
Google courts businesses with ramped up cloud AI Synopsis The internet giant unveiled new AI-powered features for data searches, online collaboration, language translation, images and more at its first annual Cloud Next conference held in-person since 2019. AP Google on Tuesday said it was weaving artificial intelligence (AI) deeper into its cloud offerings as it vies for the business of firms keen to capitalize on the technology. The internet giant unveiled new AI-powered features for data searches, online collaboration, language translation, images and more at its first annual Cloud Next conference held in-person since 2019. Elevate Your Tech Process with High-Value Skill Courses Offering College Course Website Indian School of Business ISB Product Management Visit Indian School of Business ISB Digital Marketing and Analytics Visit Indian School of Business ISB Digital Transformation Visit Indian School of Business ISB Applied Business Analytics Visit The gathering kicked off a day after OpenAI unveiled a business version of ChatGPT as tech companies seek to keep up with Microsoft , which has been ahead in powering its products with AI. "I am incredibly excited to bring so many of our customers and partners together to showcase the amazing innovations we have been working on," Google Cloud chief executive Thomas Kurian said in a blog post. Most companies seeking to adopt AI must turn to the cloud giants -- including Microsoft, AWS and Google -- for the heavy duty computing needs. Those companies in turn partner up with AI developers -- as is the case of a major tie-up between Microsoft and ChatGPT creator OpenAI -- or have developed their own models, as is the case for Google.

중소기업 71% "올해 신규인력 채용계획 있다" 다만 기업규모별 양극화는 깊어지는 중

Picture

Member for

8 months 2 weeks

Real name

한세호

Position

기자

Bio

[email protected]
세상에 알려야 할 수많은 이야기 가운데 독자와 소통할 수 있는 소식을 전하겠습니다. 정보는 물론 재미와 인사이트까지 골고루 갖춘 균형 잡힌 기사로 전달하겠습니다.

입력

2023-06-15 12:52

수정

2025-02-21 18:25

중소기업 10곳 중 7곳은 올해 신규 인력 채용 계획이 있는 것으로 조사됐다. 지난해 조사와 비교할 때 채용을 고려하는 기업 비율은 소폭 줄었지만, 평균 채용계획 인원은 오히려 늘었다. 특히 제조업 생산직에서 인력 수요가 가장 높았던 것으로 나타나며 팬데믹 이후 중소기업 고용시장 내부에서도 양극화 현상이 나타나고 있는 것으로 보인다.

지난해 대비 채용계획 기업은 5.6% 하락

중소기업중앙회가 지난 4월 ‘참 괜찮은 중소기업’ 플랫폼에 등재된 중소기업 1,031개사를 대상으로 실시한 ‘2023년도 채용동향조사’ 결과를 14일 발표했다. 조사결과 응답기업의 71.0%가 신규 인력 채용을 계획하고 있다고 답했다. 지난해 같은 조사에선 응답기업의 76.6%가 채용계획이 있다고 답한 것과 비교할 때 채용을 고려하는 기업의 비율이 소폭 줄어든 셈이다.

그러나 기업당 평균 채용인원은 상반된 양상을 나타냈다. 올해 채용 규모는 평균 6.6명으로 지난해 4.3명보다 2.3명이나 더 늘었다. 채용 규모를 확대한다는 응답(27.4%)도 규모를 축소한다는 응답(9.7%)보다 높았다. 지난해와 유사한 수준이라는 응답은 62.9%였다.

한편 채용계획이 있는 기업 가운데 37.6%가 경력직을 선호했고, 별도 자격을 요구하지 않는다고 응답한 비율도 41.4%로 높았다. 아울러 올해 인력운용현황에 대한 설문에서는 과반수(55.7%)의 중소기업이 인력 상황이 적정하다고 응답했다. 필요인원 대비 재직인원 비율은 평균 90.9%로 전년 대비 8%p 증가했으며, 필요인원의 ‘100%’ 이상을 채용한 기업 역시 49.9%로 지난해(29.3%)보다 증가했다. 이는 코로나19 방역조치가 완화됨에 따라 고용 상황이 점차 회복되는 것으로 풀이된다.

팬데믹 이후 회복되는 고용시장에 나타난 양극화 현상

고용시장 전반이 회복되고 있지만, 중소기업계 내부에서도 양극화 현상이 깊어지고 있다. 특히 기업 규모가 클수록 신규 직원을 채용하는 현상이 두드러졌다.

채용 계획이 있는 기업별 규모를 따졌을 때 300인 이상 기업이 82.6%로 가장 많았다. 이어 △100~299인 82.6% △50~99인 74.4% △10~49인 67.4% △10인 미만 52.6% 순으로 신규 직원 채용 계획이 있다고 답했다. 지난해 조사에서 기업 규모와 상관없이 신규 채용계획을 가진 기업 비율이 모두 70%대를 넘어섰던 것과 대조적이다.

특히 직무별로 살펴보면 생산직의 채용 계획이 44.7%로 가장 높았다. 팬데믹에 따라 고용 규모를 대폭 축소했던 제조업 중심으로 활발히 채용이 이뤄지고 있는 것으로 보인다. 그 뒤로는 연구개발·생산관리(32.8%), 기타(20.8%), 국내외영업·마케팅(20.1%) 순으로 높았다.

한편 정부와 지자체의 청년 취업 지원 정책 등이 중소기업의 신규 채용규모 확대에 영향을 줬다는 분석도 나온다. 실제 정부는 올해 청년 지원 제도 전반을 정비하며 청년들의 고용 확대를 위한 ‘청년 일자리 지원 제도’를 확대했다. 이 가운데 청년들의 취업 촉진을 위해 취업 수당과 인센티브를 지급하는 ‘청년도전지원사업’과 기업들의 청년 고용 확대를 유도하기 위한 ‘청년일자리도약장려금 제도’ 등이 대표적인 정책으로 꼽힌다.

청년 실업률 개선되고 있지만 ‘불안정한 일자리’ 위주

올해 들어 청년 실업률의 개선세가 두드러지고 있다. 지난달 2일 통계청에 따르면 올해 1분기 만 15∼29살 청년 실업률은 6.7%(청년 경제활동인구 417만 명 중 실업자 27만9천명)다. 이는 1999년 6월 이래 역대 1분기 가운데 가장 낮은 수치로, 코로나19 기간인 2021년 이후 매 분기 개선되고 있다.

다만 이 같은 개선세와 달리 청년들의 고용 안전성을 오히려 낮아지고 있다. 청년 취업자의 산업별 취업 분포를 살펴보면 ‘숙박 및 음식점업’이 올 1분기 청년 취업자 수 증가세의 높은 비중을 차지했다. 지난해 3월 기준 청년 취업자수가 55만3천명이었지만 올해 3월에는 64만3천명으로 9만 명이나 늘었다. 반면 상대적으로 양질의 일자리로 꼽히는 제조업과 도매 및 소매업은 지난해 3월보다 각각 5만 명, 7만6천명 줄었다.

근로 계약기간을 살펴봐도 일자리 질이 나빠지고 있음이 드러난다. 올해 3월 근로 계약 기간이 1년 이상인 청년층 상용 근로자(249만3천명)는 지난해보다 4만5천명 감소한 반면, 계약 기간 1개월 이상∼1년 미만인 청년 임시직(106만8천명)과 계약 기간 1개월 미만인 청년 일용직(13만8천명)은 각각 1만3천명, 1만 명 늘어났다.

나아가 실업자 통계에 포함되지 않는 ‘그냥 쉬는 청년’도 급증하고 있다. 올해 1분기 자신의 활동 상태를 ‘쉬었음’이라고 답한 청년 수는 전년 동기 대비 5.1% 늘어난 45만5천명으로 1분기 기준 역대 최대치를 기록했다. 한국노동연구원 관계자는 “과거 쉬는 인구에는 정년퇴직이나 건강상의 이유를 가진 고령층 비중이 높았지만, 현재는 청년 비중이 급증하는 추세”라며 “청년들의 일자리 질을 개선하기 위한 정책과 고용 정책에 더욱 적극적인 지원이 필요할 것으로 보인다”고 강조했다.

Picture

Member for

8 months 2 weeks

Real name

한세호

Position

기자

Bio

[공지] The Economy Korea 기사 작성 방식

Picture

Member for

8 months 2 weeks

Real name

TER Editors

Bio

The Economy Research (TER) Editors

Published

2023-02-13 17:32

저희 The Economy Korea는 아래의 국내 언론사들로 구성되어 있습니다

The Economy Korea: https://kr.giai.org
- 파이낸셜 이코노미: https://financial.economy.ac
- 테크 이코노미: https://tech.economy.ac
- 바이오 이코노미: https://bio.economy.ac
- 폴리시 이코노미: https://policy.economy.ac

The Economy Korea 뉴스 포털은 파이낸셜, 테크, 바이오, 폴리시 이코노미의 한국 내 총괄 서비스입니다. 글로벌 본사인 The Economy는 AI/Data Science 기반 경제 분석 기관으로 글로벌 AI협회(Global Institute of Artificial Intelligence, GIAI)와 글로벌 교육 전문지 EduTimes가 각각 연구 부분과 언론 매체 운영을 분담하고 있습니다.

연구 사업으로는 경제 정책 분석, 분야별 기업 랭킹 발표, AI/Data Science 활용 연구 등이 있고, 언론 홍보 목적에서 시작된 언론 매체는 영문 콘텐츠의 타국어 번역 정확도를 향상시키기 위한 연구를 진행 중입니다.

한국어 판은 GIAI의 한국 자회사 (GIAI Korea, https://kr.giai.org)에서 글로벌 서비스와 콘텐츠 및 기술 제휴 아래 운영됩니다.

국내 운영 언론사들의 기사가 작성되는 방식은 다음과 같습니다

1.기초 소스 확보

취재를 나갈 수도 있겠지만, 요즘은 보도자료를 뿌리는 경우가 많습니다. 그러나 대부분의 보도자료는 자기들이 보여주고 싶은 부분만 보여줍니다. 정부의 정책브리핑에서 예시를 하나 갖고 왔습니다.

2.보도자료에 대한 의구심

이건 한국 벤처업계가 유니콘 기업 22개나 만들었다고 엄청나게 자랑하는 보도자료인데, 우리나라에 있는 유니콘 기업들 중에 기술력이 있거나, 남들이 하지 않은 도전을 해서 성공한 덕분에 시장에서 정말 유니콘 대접을 받는 스타트업들은 거의 없습니다.

저 위의 리스트도 물음표가 달릴 수 밖에 없는 회사들 투성이입니다.

3.기사 꼭지

아래는 평소에 제공해주는 기사 꼭지 입니다

K-유니콘 22개 역대 최다? 글쎄요??? - 지난해 유니콘기업 7개 탄생…총 22개사 ‘역대 최다’ - 정책뉴스 | 뉴스 | 대한민국 정책브리핑 (korea.kr)

보도자료 요약
ㄴ어제(9일) 중기부가 유니콘 기업이 22개라고 현황자료 발표했는데, 내실이 전혀 없습니다 그걸 까 봅시다.

Talking Point
1.리스트에 있는 회사들 논란 많음
ㄴ옐로 모바일은 사실상 망한 회사입니다. 대표였던 이상혁은 제주도 어딘가에 몰래 숨어서 산다는 소문이 있습니다
ㄴ티몬도 2천억원 남짓에 그것도 현금도 아니고 지분 교환 방식으로 작년 9월에 큐텐에 헐값 매각 됐습니다
ㄴ쏘카는 IPO로 졸업했다는데, 어제 주가 기준 시총이 7,026억원에 불과합니다. 1조원 클럽인 유니콘 조건에 거리가 멀죠
ㄴ올 초에 상장 예정인 오아시스도 서울거래 비상장에서 현재 가치가 6,989억원입니다.언급된 회사들은 서울거래 비상장 들어가서 검색해서 스크린 샷을 좀 추가해놓읍시다
일단 오아시스 하나 추가해놨습니다

2.기업 사정 생각 안 하는 숫자놀음이라는 혹평 - 오래 전부터 나오던 이야기
ㄴhttps://m.blog.naver.com/ssebiz/221970171173
ㄴhttps://www.kcmi.re.kr/publications/pub_detail_view?syear=2020&zcd=002001016&zno=1536&cno=5486

3.중기부가 저렇게 과대평가된 걸 더 홍보해주고 돌아다니는게 아니라, 거꾸로 구조개혁해서 합리적인 평가가 이뤄지도록 시장 개선에 도움을 줘야 함 - 노동 개혁, 정부 개혁 어쩌고 그러는데, 정작 스타트업계 개혁도 필수
ㄴhttps://www.sedaily.com/NewsView/1Z451UBMWF 상장 후에 주가 부진한게 이미 한 두번이 아님. 카카오 그룹 계열사들, 크래프톤, 쏘카 등등등 잘못된 밸류에이션으로 개인 투자자들 농락하지 못하도록 시장 규제 만드는데 중기부가 앞장서도 시원잖을 판국에 거꾸로 가짜 밸류에이션을 홍보해주고 있으니 ㅉㅉ

4.기사 꼭지 이해 후 작성

기사 꼭지를 제대로 이해하고 기사를 작성하는 단계입니다.

"이 기업이 유니콘이라고요? 현실은 한참 못 미치는데요" 중기부 유니콘 기업 발표 두고 설왕설래 - 테크 이코노미

완성된 결과물은 위와 같습니다.

들어가서 읽어보시면 알겠지만, 위의 꼭지 3개를 제대로 이해해야 쓸 수 있는 기사입니다.

5.이미지 제작

필요한 경우에는 이미지도 제작해야 됩니다. 물론 직접 이미지 작업까지 다 하라는게 아니라, 디자인 담당자가 배정되어 있습니다.

위와 같이 디자인 팀에 적절한 이미지를 요청합니다. 제대로 잘 되었다면 아래와 같이 적절히 작성된 이미지가 들어간 기사가 나옵니다

6.추가 편집

아무리 열심히 기사를 썼어도 오탈자가 있거나 이미지에 문제가 있거나 등등으로 사소한 문제가 생길 수 있습니다. 그럼 편집 팀이 작업을 진행합니다. 뿐만 아니라, 사실 관계에 문제가 있을 경우 '팩트 체크'까지 진행합니다.

인력 뽑아본 후기

저렇게 Talking Point 뽑고 설명을 포함한 관련 기사를 뽑는 작업이 귀찮은 것이 사실입니다. 무슨 학창 시절에 레포트 급하게 하나 써서 내는 기분인데, 대학을 무사히 졸업하신 분들이라면 저런 자료 조사 정도는 직접 할 수 있어야 되는 것 아닌가요? 뽑는데 빠르면 5분, 꼼꼼하게 하면 20분 정도 걸리는데, 실제로 20분이면 전문기자들이나 증권사 리서치 애널리스트들이 기사, 보고서를 하나 쓸 수 있는 시간입니다. 이렇게까지 친절하게 뭘 써야하는지 설명을 해 줄 필요가 있나, 월급 아깝고 Talking Point 뽑는 시간 아까운데.. 라는 생각을 하지 않을 수 없습니다. 그럼에도 불구하고, 어떤 사건에 대해 무슨 자료를 찾아보고 어떤 방식으로 생각을 가다듬어야 한다는 방향 설정을 해 줘야 인력을 키울 수 있다고 생각해서, 잘 써봐야 기사가 아니라 소설 밖에 못 쓰던 인력들을 내보내면서 한국 자회사 운영방식을 변경했습니다.

그렇게 일반 기자들을 내보내고, 기사 작성 시스템을 바꾸면서, '설마 이 정도는 다들 할 수 있겠지'라고 생각하고 인재를 뽑아봤습니다. 안타깝게도, 이 정도 요청을 정상적인 신문 기사로 만들어 낼 수 있는 인력도 찾기가 쉽지 않았습니다.

(2022년 12월 기준) 88명 서류 받으면서 당사에서 운영 중인 언론사들 명칭을 지원서에 쓰라고 했더니, 절반 이상이 틀렸습니다. 한 60대 아저씨는 그게 무슨 말인지 모르겠다고 전화까지 왔습니다. 전직 기자 경력 20년이라는 분입니다. 홈페이지 하단에 언론사 명칭이 있는게 당연한 경험들이 오랫동안 쌓이셨을텐데.... 지원하는 회사가 운영하는 언론사 명칭도 못 찾아보면 어떻게 일을 하겠다는거죠?

저렇게 뽑아서 공유한 Talking Point를 이메일로 보내줬더니 실제로 기사를 써서 내는 경우가 13명이었습니다. 대부분 충격적으로 문장 구성이 조잡했는데, 그래도 좀 가르쳐서라도 쓸 수 있겠지라고 양보하고 뽑아보니 5명이 남았습니다. 2일간 교육 자료 읽어보라고 PDF 설명서 파일도 주고, 웹 상에서 볼 수 있도록 OneNote 링크도 보내주고, 공지와 직원 간 대화를 찾아볼 수 있는 저희 회사 내부 게시판도 열어줬습니다. 읽어보면서 찬찬히 준비하라고.

업무를 시작한 첫째 날부터 기사 편집할 일이 넘쳐난다고 갑자기 편집 팀에서 화를 냅니다. 기본적인 문장 구성도 못 하길래 도대체 어떻게 서류 통과한거지 궁금해하며 1명씩 내보내고 나니 1주일도 되기 전에 딱 3명 남았습니다.

제시해 준 Talking Point를 바탕으로 실제로 읽기에 불편하지 않은 글을 적당한 시간 안에 뽑아올 수 있는 경우는 평소에도 위의 3/88 = 3.41% 정도에 지나지 않았습니다. 이 정도가 한국 사회에서 '글 밥'을 먹고 싶다는 분들의 현 주소입니다. 저희가 쓰는 기사라는 글이 기껏해야 1-2장짜리 문과 교양 수업 레포트에 불과한데, 이걸 못하면서 글로 돈을 벌겠다는 생각을 하는게 좀 납득하기가 어려웠습니다.

떨어지신 분들 중에는 이름이 알려진 굴지의 국내 신문사 출신이신 분들도 있습니다. 신문사 아니고 증권사 리서치 같은 기관이냐고 질문하신 모 신문사 출신 기자 분도 있었군요. 국내 신문사들 대부분이 이렇게 자료 조사하는 일 없이, 기업에서 보내주는 보도자료 적당히 베껴 쓰고, 부족하면 그 회사에 '출입처'라는 걸 두고 전화해서 전해들은 내용을 쓴다더군요. 그게 우리나라 신문사들의 '기자'라는 분들이 일하는 방식이었습니다.

발로 취재? 구글링으로 취재도 제대로 못하는데 어떻게 기자라고 할 수 있겠습니까?

어떤 조직의 구성원이라는 사실이 자랑스러우려면 그 조직이 역량 측면에서 글로벌 최상위권 조직이어야 할 겁니다. 역량 측면에서 글로벌 최상위권 조직이라는 인정을 받으려면 만들어내는 상품이 글로벌 최상위권 수준이어야 합니다. 지식 상품으로 글로벌 최상위권 상품을 만들어 내는 방법은 크게 2가지 입니다. 노벨상을 도전해볼만한 연구 논문처럼 천재들만 도전할 수 있고, 천재가 아니면 기적이 일어나야 고급 논문을 쓰는 방식이 그 중 하나입니다. 다른 하나는 매우 뛰어나지는 않지만 열정과 능력을 갖춘 인재들이 자신들만의 강점을 협업과 분업으로 결합해서 1명의 천재가 만들어낸 것과 유사한 수준의 고급 콘텐츠를 만들어내는 것입니다. 협업과 분업으로 노벨상은 버겁겠지만, 기업의 고급 제품을 만들어내는 것 정도는 충분히 가능하다는 것이 이미 산업화가 시작된 1700년대부터 인류에게 상식이 되어 있습니다.

고작 문과 교양 수업 레포트 정도의 업무를 하면서 글로벌 최상위권 상품을 목표로 해야할 이유도 없고, 천재가 투입되어야 할 이유도 없습니다. 저희는 2번째 방법으로 협업과 분업을 통해 콘텐츠의 수준을 높이는 것을 목표로 돌아가는 조직입니다. Talking Point라는 이름으로 기사 방향도 상세하게 뽑아주고, 그래픽 작업을 위한 디자인 팀도 있고, 기사 편집도, 심지어 팩트 체크도 돌아갑니다. 글 작성자가 편하게 글을 쓸 수 있는 IT시스템도 개발했고, 웹사이트 디자인의 완성도도 대단히 높은 편입니다. 구글 페이지 스피드(https://pagespeed.web.dev)에서 저희 웹사이트와 국내 1등 IT기업들인 네이버/다음 홈페이지들의 점수를 비교해보시면 저희가 웹사이트 완성도를 얼마나 높여놨는지 눈으로 확인하실 수 있을 겁니다.

지난 몇 년간의 시행착오 끝에 완성도 높은 '기사'라는 상품을 대량 생산해 낼 수 있는 생산 공정을 완성했습니다. 남은 빈 칸은 그런 지원을 묶어 '고급 기사'라는 글을 써 내는 일입니다. 그렇게 남은 빈 칸을 채워서 고급 기사를 만들어 낼 수 있는 역량을 갖춘 분, 그 과정에서 짜릿한 성취감을 느끼고 싶은 분들과 함께 하고 싶습니다.

(2024년 7월 추가) 자체 기사 작성과 외부의 전문 콘텐츠 번역 기사 업무로 공고를 올렸습니다. 1주일 동안 합계 33개의 지원서를 받았는데, 공고 안에 꼭 제출해라고 명시해놓은 과제를 제출한 경우는 불과 5명입니다. 번역은 경제지 관련해서 상당한 전문성을 갖춘 분이 아니면 어려울 것이라고 공고 안에 명시를 했는데, 지원자만 많고, 과제는 거의 제출을 안 했습니다. 기사 쓰는 건 어렵고, 번역이 만만하다고 느껴졌나본데, 정작 공고는 꼼꼼하게 읽지 않았다는 뜻이겠죠.

과제를 제출하신 분들은 그 자체만으로 이미 몇 발 앞선 분들이라 어지간하면 뽑고 싶습니다만, 내용 이해는 둘째 문제고, 한글 문장 자체가 어색한 과제들만 받았습니다. 일부 공고는 사전 질문을 몇 개 추가해서, 그 질문에 적절한 답을 해야 지원서를 확인하겠다고 했는데도 불구하고 제대로 읽어보지도 않고 그냥 지원서를 던지는 경우도 많았습니다. 사전 질문이라는 중간 단계를 넣을 수 없는 공고에는 과제 제출 비중이 1/10 이하로 떨어집니다.

위의 정보에서 3가지 행동 양식을 확인할 수 있습니다.

공고의 제목만 보고, 상세 내용을 전혀 읽지 않는 지원자들이 굉장히 많다
읽긴 했지만 제대로 읽지 않는 지원자들이라 저희 기사들을 한번 정도는 읽어보고 난이도를 가늠하는 시도조차 제대로 안 했을 것이다
사전 질문에 제대로 된 답을 못 하면 고생해서 작업한 과제를 봐 주지 않겠다는 공고를 무시할만큼 자신감이 넘쳤다

과제 제출하신 5명 중 1명 정도가 읽다가 화가 나지 않을 수 있는 최소한의 요건을 갖췄습니다. 이 분도 기사라는 글을 쓸려면 많은 공부를 해야할텐데, 내부 시스템을 둘러보며 최종 심사 단계 전에 준비하시는 걸 보면서 쉽지 않겠다는 생각을 하게 됐습니다. 어디에서 어떻게 찾아서 확인해야 된다는 걸 잘 정리해놨는데, 찾질 못하기 때문에 중간에 계속 브레이크가 걸리는 것이 눈에 보이기 때문입니다. 기사를 쓸려면 많은 글을 빠르게 읽고 이해해야 할 텐데, 그런 글들에서 핵심 정보들을 바로바로 찾아내야 할텐데, 과연 살아남으실 수 있을까요?

웹 디자인을 하면 직관적으로 이해할 수 있도록 매우 쉽게 웹사이트를 구성해야하고, '바보'가 와서 실수하는 사건들을 역추적하는 QA라는 작업을 최소화하기 위해 많은 고민을 담습니다. 그렇게 디자인을 해도 결국 QA에 상당한 비용을 쓰지만, 아예 읽지 않고, 보지 않고, 듣지 않는 사람들은 배제합니다. 듣지 않는 사람들과는 토론하지 않는 것과 같은 맥락이죠. 글을 써서 돈을 벌겠다는 분이 전문 작가 수준으로 글을 잘 쓰지도 못하시면서 글을 읽지도 않으면 과연 성장할 수 있을까요? 글을 잘 쓰는 첫 걸음은 좋은 문장을 많이 읽는거라는 다독, 다작, 다상량의 3다(多) 이론을 굳이 언급할 필요는 없을 겁니다.

초A급 기자가 아니면 쓸 수 없는 기사를 쓰라고 강요한다며 기자 출신들이 불평을 하다가 회사를 떠났습니다. 남들과 다를 바 없는 기사를 쓰는 조직을 키울 생각이 없는 만큼, 아니 그렇게는 조직이 크지 못할 것을 아는 만큼, 기사 수준을 끌어올리기 위해 많은 고민을 하다 지금의 분업 시스템을 구축했습니다. Talking Point는 국내 극초최상위 0.01%의 인재가 뽑아야겠지만, 글로 옮기는 기자들은 화려한 스펙의 소유자들이 아닙니다. 그럼에도 불구하고 국내 기업 관계자들을 만나면 '연구소인 것 같다', '인력 수준이 엄청 높을 것 같다'는 칭찬 아닌 칭찬을 자주 듣습니다. 분업 전에는 3류 찌라시 취급을 받다가, 그 분들의 태도가 180도 바뀐 것을 확인하면서 겨우 한 걸음 내디뎠구나는 생각을 합니다. 글로벌 본사가 AI 연구소, 경제 연구소인데, 체면은 유지시켜줬구나 싶어서 안도의 한숨도 내쉬기도 하는군요.

영어권에도 공고 안에 특정 단어, 문장, 표현을 웹사이트 어딘가에서 찾아서 지원해야된다고 해 놓으면 인도, 아랍 쪽 지원자들 1/10 미만에게서 답을 확인할 수 있습니다. 영어가 모국어가 아니어서 그럴 수도 있다고 반박하겠지만, 영어가 모국어가 아닌 국가들 중에 필리핀, 대만, 아프리카 몇몇 국가에서 거의 예외없이 지원자들이 정답을 제출합니다. 국가 별로 문장을 읽고 이해하는 교육 수준이 다른 것이 지원자들의 행동 양식에도 반영된 것일 겁니다.

한국은 위에 언급한 국가들 대비 급여 수준이 적게는 4~5배, 많게는 10배 이상 높습니다. 분업 시스템이 갖춰져 있어 업무 난이도도 낮은 편입니다. 글로벌 팀이 효율적인 시스템이라고 판단했는지 저희 한국 시스템을 벤치마킹하려고 많은 노력을 하고 있습니다. 그런데, 한국 실상 탓에 채용과 운영을 이렇게 타협할 수밖에 없었다고 설명해주면 많이들 놀랍니다. 한국은 글로벌 시장에서 가장 교육열이 높은 나라, 인구 대비 가장 인재가 많은 나라라는 선입견이 깔려 있었기 때문일 겁니다. 그들의 선입견과 여러분들의 지원 자세 간의 격차가 얼마나 큰 지 한번 돌이켜 보고 나면, 굳이 저희 회사가 아니더라도 여러분들의 눈높이에 맞는 직장을 찾아가시는데 많은 도움이 되리라 생각합니다.

Picture

Member for

8 months 2 weeks

Real name

TER Editors

Bio

The Economy Research (TER) Editors

[공지] The Economy Korea 운영방침 - Talking Point 예시

Picture

Member for

8 months 2 weeks

Real name

TER Editors

Bio

The Economy Research (TER) Editors

Published

2023-01-01 12:00

예시1

O1. 네이버 시리즈온, 영화 다운 중단하고 스트리밍 집중…OTT로 가나? - https://news.mt.co.kr/mtview.php?no=2023061416555779230

기사 시작부분
ㄴLead-in: 네이버도 OTT 시장에 뛰어든다는 제목으로 변경해야되지 않을까요? 영상 다운로드 서비스가 단가 안 나온다고 생각하고 아예 스트리밍으로 뿌리는가봅니다.

기사 핵심부분 Talking Point
1.보도자료 내용 정리
ㄴ소비자들의 서비스 구입 행태랑 달라서 바꾼다?
ㄴ다운로드가 구매 개념, 스트리밍은 대여 개념인데, 가격 경쟁력을 갖춰야 소비자들을 끌어들일 수 잇으니까

2.과거엔 스트리밍 서비스의 보안 이슈가 문제가 됐었는데
ㄴhttps://pallycon.tistory.com/entry/%EB%84%B7%ED%94%8C%EB%A6%AD%EC%8A%A4%EB%8A%94-%EC%96[…]B4%ED%98%B8%ED%95%98%EB%8A%94%EA%B0%80-%EC%A0%9C1%EB%B6%80
ㄴ예전의 Active X보다 더 발전된 기술들이 많이 있어서 극복할 수 있는 부분도 고려 요소 + 앱으로 소비하는 OTT형 패턴이 일상화된 점

3.네이버 입장에서 기대할 수 있는 시너지
ㄴOTT와 이커머스 연계한 쿠팡 플레이의 성장세
ㄴ네이버 스토어와 시리즈온을 연계하면?
ㄴhttps://www.mk.co.kr/premium/life/view/2021/12/31216/ 이미 아마존이라는 성공한 모델이 있음 - 아마존 + 아마존 프라임

작성 결과물 예시:
ㄴ네이버도 OTT 시장으로, 이커머스-OTT 연계성 높아진다 – OTT Ranking

예시2

P2. 새로운 ‘인구개념’으로 지역활력 높이고 지방소멸에 대응한다
- 새로운 ‘인구개념’으로 지역활력 높이고 지방소멸에 대응한다 - 정책뉴스 | 뉴스 | 대한민국 정책브리핑 (korea.kr)

보도자료 요약
ㄴ파일 다운로드: https://www.mois.go.kr/frt/bbs/type010/commonSelectBoardArticle.do?bbsId=BBSMSTR_000000000008&nttId=100426
ㄴLead-in: '생활인구'라는 개념으로 인구감소를 극복하는게 될지 잘 모르겠습니다만 일단 새로운 정보라는 관점에서 효과는 있을 것 같습니다. 어떻게 측정할지 궁금하군요.

Talking Point
1.보도자료 내용 정리
ㄴ'생활인구'라는 개념 도입한다
ㄴ행안부가 어찌어찌

2.'생활인구'라는게 도대체 뭐임?
ㄴhttp://firiall.net/wiki/1285
ㄴhttp://www.chunsa.kr/news/articleView.html?idxno=54871

3.생활인구 늘리면 지방 소명 대응 가능? OK 근데 생활인구 늘릴려면 인프라가 갖춰지고 거기에 직장이 있어야 되는데? 인프라 투자는 누가하지?
ㄴ지방 사례 http://www.sisaweek.com/news/articleView.html?idxno=202215 - 결국 정부가 돈 붓는건데, 표현만 바뀌었지 지방에 인프라 구축해서 인구 이동을 유도하겠다는 뜻입니다
ㄴ해외 사례 https://www.unipress.co.kr/news/articleView.html?idxno=7361 - 복수주소제 이거 좀 의미 있어 보이네요

작성 결과물 예시:
ㄴ 인구감소 및 지방소멸 위기, ‘생활인구’ 통해 돌파구 찾는다 – Policy Economy

예시3

V1. 코로나로 침체됐던 고용시장 회복될까…中企 71% “올해 신규인력 채용계획 있다”
- 코로나로 침체됐던 고용시장 회복될까…中企 71% “올해 신규인력 채용계획 있다” - 전자신문 (etnews.com)

보도자료 요약
ㄴ보도자료: https://www.kbiz.or.kr/ko/contents/bbs/view.do?seq=154672&mnSeq=207
ㄴLead-in: 이렇게 경기 안 좋다고 말 많은데 그래도 신규인력 채용계획 있는 중소기업이 많군요?

Talking Point
1.보도자료 내용 정리
ㄴ전반적으로 매년 나오는 설문조사와 크게 다르지 않아 보입니다만, 경기 상황이 안 좋은데도 불구하고 채용 의사가 있다고 표현하는건 긍정적으로 보입니다

2.정부 나름대로는 나서서 지원해준다고 각종 청년 채용 지원 프로그램이 있습니다만
ㄴhttps://www.korea.kr/news/policyNewsView.do?newsId=148910316
ㄴ대졸자, 나이 34세 미만, 주 30시간 이상 근무 이런 조건인데, 누군가는 지원을 받겠지만 지원 받기 쉽지 않은 조건이죠.

3.그 와중에 요새 청년실업률 내려간다는 이야기가 많은데
ㄴhttps://www.hani.co.kr/arti/economy/economy_general/1090319.html
ㄴ이유가 숙박, 음식점 청년취업가 9만명 늘었답니다. 아예 쉬는 청년도 엄청나게 많이 늘었고...
ㄴ다들 대기업 아니면 안 들어가고 일단 취업재수한다는 생각 때문이겠죠. 저도 뽑아보니 기준 이하 인력 때문에 스트레스 받느니 그냥 안 뽑고 말지가 되어 버렸고...

작성 결과물 예시:
ㄴ 중소기업 71% "올해 신규인력 채용계획 있다" 다만 기업규모별 양극화는 깊어지는 중 – Tech Economy

예시4

P5. 서울시, 서울백병원 도시계획시설(종합의료시설) 결정 추진 검토
보도자료 요약
ㄴ파일 다운로드:https://seoulboard.seoul.go.kr/comm/getFile?srvcId=BBSTY1&upperNo=390770&fileTy=ATTACH&fileNo=2&bbsNo=158
ㄴLead-in: 인제대학교가 만성적자인 서울백병원 접고 서울 한 가운데에 있는 땅에서 다른 사업할려고 하는 것 같던데, 그거 막으려는 정책으로 보입니다

Talking Point
1.인제대학교가 서울백병원 접겠다~ 고 선언한 내용 관련
ㄴhttps://www.hani.co.kr/arti/society/health/1094726.html
ㄴ다른 대학 병원들 접을 때는 말 없더니 백병원은 왜 이렇게 딴지를 거는 걸까요ㅋㅋ

조금 더 배경을 추가하면, 대학들은 교육용 재산과 수익용 재산을 보유해야 하고, 교육용 재산은 애들 가르치는데 쓰이는 건물, 운동장, 도서관 같은 것들, 수익용 재산은 재단 운영비를 애들 등록금에서 충당하지 말고 너네 수익성 재산으로 충당해라, 등록금은 오직 애들 교육 목적으로만 써야 한다는 이유에서 구분이 되어 있습니다.근데, 우리나라 대학들 중에 수익용 재산을 교육부 요건대로 갖고 있는 곳들이 거의 없습니다.

4년제 대학은 300억인데, 몇 군데가 그걸 갖고 있으려나요? 심지어 대부분은 어디 산골에 있는 산비탈 같은거에요. 재산상의 의미가 없는 것들이 대부분이죠. 인제대학교도 서울 도심 한 가운데에 있는 땅에서 수익도 안 나오는 병원을 계속 갖고 있을 이유가 없으니까 저렇게 정리해버리고 수익용 재산으로 변경하겠다는건데 (실제 속셈은 잘 모르겠습니다만...)

2.서울시가 도시계획시설로 지정하겠다
ㄴ좀 전에 나온 서울시 보도자료 내용 입니다 (파일 다운로드 참조)
ㄴ읽어보면 알겠지만, 수익용 재산으로 못 바꾸도록 도시계획시설 -(https://m.blog.naver.com/seog11111/221373718301) 중 보건위생시설로 강제로 지정해버리겠다는겁니다.
ㄴ이렇게 지정되면 빼박 무조건 여기서 병원해야지 다른 사업을 할 수가 없게 됩니다. (저희 회사도 지방에 있는 땅이 흑흑흑)
ㄴ백병원 너네 사업 접겠다고? 어쭈? 엿 먹어라~ 이거죠

3.서로 타협안을 찾아야지, 이렇게 정부가 엿먹어라로 나오면 안 되겠죠?
ㄴ백병원이 가지는 시민사회 기여분을 감안해서 서울시-중구청이 적자 보전을 비롯해서 이런저런 지원을 해주는 방식으로 풀어내야 한다는 이야기 씁시다
ㄴ20년간 누적 적자가 1,745억원이라는데 누가 기분 좋아서 계속 운영하겠습니까
ㄴhttp://www.monews.co.kr/news/articleView.html?idxno=323819 // http://www.docdocdoc.co.kr/news/articleView.html?idxno=3006874이런 불만이 생기는건 충분히 이해됩니다만, 그만큼 시민사회가 병원에 좀 도움을 줘야 서로 상생할 수 있지 않을까요?

작성 결과물 예시:
ㄴ '적자 그만' 폐원안 내놓은 서울백병원, 서울시 '도시계획시설 결정' 초강수 – Policy Economy

Picture

Member for

8 months 2 weeks

Real name

TER Editors

Bio

The Economy Research (TER) Editors

경제학자들이 알아야하는 ML, DL, RL 방법론

Picture

Member for

8 months 2 weeks

Real name

Keith Lee

Bio

Professor of AI/Data Science @SIAI
Senior Research Fellow @GIAI Council
Head of GIAI Asia

Input

2021-06-28 00:00

아직까지 머신러닝, 딥러닝, 인공지능 같은 단어들이 보고서를 통과시켜주는 마법의 단어인 2류 시장 대한민국과 달리, 미국, 서유럽에서는 이런 계산과학 방법론을 다른 학문들이 어떻게 받아들여야하는지 이미 한번의 웨이브가 지나가고, 어떤 방식으로 쓰는게 합리적인지 내부 토론으로 정리가 되어 있다.

출신이 경제학이라 석사 이후로 발을 뺀지 오래되었음에도 불구하고 습관처럼 유명한 경제학자들 웹페이지에 올라온 Working paper나 기고를 훑어보는데, 오늘은 경제학에서 ML 방법론을 어떻게 받아들이고 쓰고 있는지에 대한 정리글을 소개한다.

Machine Learning Methods Economists Should Know About

참고로 원 글의 저자는 Stanford 경제학 교수 2명이고, 글이 외부에 공개된 시점은 2019년 3월이다. (대략 2017-2018년에 이미 논의가 정리되었었다고 봐도 된다.) - 글 링크

Model-based vs. Algorithmic Approach

기존의 통계 모델링을 하는 사람들이 대체로 Model-based 접근을 하는 반면, 계산과학을 하는 사람들(중 일부)은/는 모델을 못 정하고 시작해도 Algorithm이 데이터 속의 관계를 찾아내줄 수 있다는 관점을 갖고 데이터에 접근한다.

어느 쪽이건 실제 데이터가 갖고 있는 숨겨진 구조를 찾아내고, 그 구조를 미래 예측이나 자신의 문제를 해결하는 용도로 쓰려고 한다는 "Listen to Data"라는 최종 목적지는 동일하지만, 출발점을 어디로 두느냐가 다를 뿐이다.

그간 파비클래스 강의에서부터 여러 경로로 꾸준히 반복해왔던 말이기도 한데, 데이터의 실제 구조를 어느 정도 예측할 수 있다면, 계산비용을 과다하게 지불하면서 적절한 모델을 찾아줄 것이라는 막연한 기대를 갖고 접근할 필요없이, 알고 있는 모델을 바탕으로 데이터를 활용하면 된다.

가장 단순한 계산이 OLS 같은 선형 계산법이고, 그 외에도 데이터의 분포함수를 알고 있다면 쓸 수 있는 Maximum Likelihood Estimation (MLE), 혹은 데이터가 반드시 충족해야하는 Expectation (ex. E(x) = 1)을 활용하는 Method of Moments Estimation (MME) 등의 계산법이 있다.

데이터가 정규 분포를 따르고 있지 않으면 OLS = MLE가 깨지면서 MLE가 우월한 계산법이 되고, 데이터의 입력 변수가 2개 이상일 경우에 (Decision Theory 논리에 따르면) 일반적으로 MME가 OLS보다 우월한 계산법이다.

그런데, 데이터의 분포함수도 모르고, Expectation도 확신이 없다면?

그렇다고 데이터 속에 Endogeneity 같은, 반드시 IV 등을 이용한 데이터 전처리가 굳이 필요없는 데이터를 모았는데, 그 데이터 속에서 패턴을 찾아내야한다면? 그 관계가 선형 구조가 아닐 것이라는 매우 강한 확신이 있다면? (ex. 이미지 인식, 자연어 처리)

이 때 Algorithmic approach가 엄청난 파워를 발휘할 수 있다.

기존의 OLS, MLE, MME 등의 통계학 계산법들이 못 찾아냈던 패턴을 찾아내주니까.

통계학에서는 "Need to move away from exclusive dependence on data models, and adopt a more diverse set of tools"라는 표현으로 Algorithmic approach를 반긴다.

단, 언제 쓴다? "Listen to Data"를 해야되니까, Data의 구조상 Algorithmic approach가 필요한 경우에만!

(데이터만 있으면 무조건 Algorithmic approach 중 가장 많이 알려진 Deep Neural Net만 쓴다는 사람들에게 바치는 문장이다.)

왜 경제학계에서는 Algorithmic approach를 안 or 늦게 받아들였나?

첫째, 경제학, 특히 계량경제학 하는 사람들은 수학적인 Formal Property를 너무 좋아한다. 수학적으로 딱 떨어지는 결과, ex. consistency, efficiency, normality, 값이 없으면 그 논문은 발표 자리에 한번 나갈 기회 얻기가 힘들다. ML 쪽에서 DNN이 항상, 언제나, 무조건 Random forest보다 우월하다는 증명이 가능할까? 아직까지 된 적도 없고, Empirical test는 안 된다는 걸 보여주고, 무엇보다 어느 모델이 다른 모델보다 Universal하게 우월할 수 없다는, 데이터에 따라 적절한 모델은 달라질 수 밖에 없다는 인식은 ML 연구자들이 공통적으로 갖고 있는 인식이다. (비전문가인 국내 개발자 집단만 DNN이 무조건 제일 좋은 줄 안다.) 그러니까 더더욱 Formal Property 좋아하는 사람들이 싫어하겠지.

둘째, 결과값의 정확도를 검증하는 방법이 1차원적이기 때문이다. 통계학 방법론들은 분산을 찾고, t-test를 위시한 평균-분산 구조에서 결과값의 검증이 가능하다. 1st moment인 평균만 쓰는게 아니라, 2nd moment인 분산까지 쓰니까, 분포함수가 정규분포라면 확정적인 결론을 얻을 수 있고, 그 외의 데이터라고해도 해당 분포함수 기반의 t-test 값이 있다면 신뢰구간에 대해 높은 확신을 가질 수 있다. 반면, ML 방법론들로는 분산을 찾는다는게 수만번 비슷한 계산을 돌려서 각각 다른 1st moment가 나오는걸 보는 방법 이외에 달리 합리적으로 분산을 얻어낼 수가 없다. 그러니까 training set, test set으로 데이터를 분리한 다음, test set에서의 정확도를 쳐다보는, 신뢰구간을 구할 수 없는 계산법에 의존하는 것이다. 이쪽에서는 Beta hat을 구하는게 아니라 Y hat을 구하고, Y hat과 실제값의 차이만 본다. 상황이 이렇다보니, 결과값의 Robustness에 논문 쓰는 능력을 검증받는 경제학계에서 ML 방법론을 쓴다는 것은, 자신의 논문이 Robustness 검증을 안 했다는 걸 스스로 인정하는 꼴이 되기 때문에, ML 방법론을 알아도 쓸 수가 없는 것이다.

셋째 이후는 배경 지식이 좀 (많이) 필요한 관계로 글 마지막에 추가한다.

약간 개인 의견을 추가하면, Algorithmic approach 중 하나로 활용 가능한 Network theory를 이용해 연구를 하던 무렵 (Network은 행렬로 정리했을 때 같은 Network이어도 눈에 보이는 Representation은 얼마든지 달라질 수 있다 - Isomorphism 참조), 이런 Network이 얼마나 Robust한 설명인지를 따지려면 여러가지 경우의 수를 놓고 봐야할텐데, 모델이 완전히 달라질 것 같고, 아니면 아예 못 푸는 문제가 될 것 같은데, 과연 논문을 Publish하는게 가능하겠냐는 우려 섞인 걱정을 해준 분이 있었을 정도였다.

요즘 DNN에서 Node-Link 구조가 조금만 바뀌어도 모델이 완전히 바뀌는데, 거기다 데이터만 바뀌어도 Link값들이 크게 변하는데 과연 믿고 쓸 수 있느냐는 질문이 나오는데, 이런 질문이 1990년대 후반, 2000년대 초반에 Neural Network에 Boltzmann 스타일의 Gibbs sampling + Factor Analysis 접근이 시도되었을 때도 나왔던 질문이다. 현재까지 Boltzmann 구조보다 더 효과적인 Network의 Layer간 Link 값 계산을 정리해주는 계산법이 없으니 현재도 유효한 질문이고, 사실 Network이라는 구조 그 자체가 이런 "코에 붙이면 코걸이, 귀에 붙이면 귀걸이"라는 반박을 이겨내기 힘든 구조를 갖고 있기도 하다.

Ensemble Methods vs. Model Averaging

무조건 Algorithmic approach를 피했던 것은 아니고, 실제로 Algorithmic approach라고 생각하는 계산법들을 경제학계에서 이용한 사례도 많다. 대표적인 경우가 ML에서 쓰는 Ensemble 모델과 경제학에서 흔히 쓰는 Model Averaging 방법이다.

예를 들어, Random Forest, Neural Network, LASSO를 결합하는 Stacking 계열의 Ensemble을 진행한다고 생각해보자. 이걸 Model Averaging이 익숙한 계량경제학의 관점으로 다시 표현하면,

$latex (\hat{p}^{RF}, \hat{p}^{NN}, \hat{p}^{LASSO}) = \underset{p^{RF},p^{NN}, p^{LASSO}} {\text{arg min}} \sum_{i=1}^{N^{test}} (Y_i - p^{RF} \hat{Y}_i^{RF} - p^{NN} \hat{Y}_i^{NN} - p^{LASSO} \hat{Y}_i^{LASSO})^2 \\ \\ \text{subject to } p^{RF} + p^{NN} + p^{LASSO} = 1, \text{ and } p^{RF}, p^{NN}, p^{LASSO} \geq 0 $

이라고 쓸 수 있다.

원래의 Y값을 가장 잘 설명하는 모델을 찾고 싶은데, 3개 모델의 가중치 합계가 1이 된다는 조건 (& 양수 조건) 아래, 셋 중 어떤 모델을 써서 오차를 최소화하는지에 맞춘 최적화 계산을 하는 것이다.

(아마 일반 유저들이 활용하는 Stacking Library도 위의 방식으로 최적화 계산이 돌아가고 있을 것이다.)

단순히 위의 3개 ML 계산법 뿐만 아니라, MLE, MME, OLS 등등의 통계학 계산법을 활용할 수도 있고, 어떤 계산법이건 합리적이라고 판단되는 계산법들을 모아서 Model Averaging을 하고 있으면, Ensemble과 이론적으로, 실제로도 동일한 계산이 된다.

단, 합리적이라고 판단할 수 있는 계산이 경제학에서는 Bias-Variance trade off를 놓고 볼 때, Bias가 없는 쪽만 따지는게 아니라, Confidence interval (또는 Inference)도 중요하게 생각하는 반면, ML에서는 분산 값 자체가 없으니까 철저하게 Out-of-sample performance, 즉 Bias가 없는 쪽에만 집중한다.

그래서 Stacking 또는 Model Averaging에 넣는 후보 계산법들도 달라질 수 있고, 결과값의 Inference에 대한 요구치도 다르다.

독자들의 이해를 돕기 위해 약간의 개인 견해를 덧붙이면, 선거 여론조사 여러개를 평균해서 가장 실제에 가까운 값을 찾는다고 했을 때, ML 방법론을 쓰는 사람들은 1,000명이건, 500명이건, 10,000명이건, 몇 명에게 물었건 상관없이 평균값 = 실제값으로 일단 가정하고, 그 값 근처에 있는 여론조사를 우선 갖다 쓰고, 틀렸으면 다른 여론조사로 갈아 끼운다는 관점이라고 볼 수 있다. 반면 경제학 방법론을 쓰는 사람들은 500명이면 분산이 너무 크기 때문에, 분산이 큰 경우에는 가중치를 낮게 주고, 분산이 작은 경우에 가중치를 높게 준 다음 가중 평균을 해서 기대값을 구하고, 그 때 +- x.y% 라는 신뢰구간을 꼭 붙여야된다고 생각하는 것이다.

어차피 신뢰구간 그거 누가보는거냐고 생각할 수도 있고, 신뢰구간이 +- 20% 이렇게 터무니없게 나오면, 아무리 여러 여론조사를 모아서 평균값을 썼다고해도, 그 숫자를 누가 믿고 선거 결과 예측에 쓰냐는 반박을 할 수도 있다.

파비클래스 수업 시간에도 항상 강조하는 내용이지만, Ensemble / Stacking / Model Averaging 그 어떤 단어를 쓰건 상관없이, 기본 모델 N개를 결합할 때는 계산의 오차 (Bias)가 작은 경우만 집중할게 아니라, 믿을 수 있냐 (Variance)는 질문에 답이 나오는 모델들을 결합해야 된다고 지적한다. 이름을 어떻게 붙여서 어느 학문에서 쓰고 있건 상관없이, 수학적인 Property는 어차피 같은데, 결과값을 내가 쓸 수 있느냐 없느냐가 바로 "Listen to Data"를 제대로 했는지 아닌지에 따라 결정되기 때문이다. 모델의 Variance가 크다는 말은 Listen to Data를 하지 않은 모델이라는 뜻이니까. (혹은 너무 샘플 데이터만 곧이곧대로 믿었다는 이야기니까.)

Decision Tree vs. Regression Tree

ML 계산법을 처음 보는 사람들은 Decision Tree라는게 Regression보다 압도적으로 우월한 계산 아니냐는 질문을 하는 경우가 종종 있다. 근데, 기본형 Tree도, 확장버전인 Random Forest도 모두 UC Berkeley 통계학자가 1984년, 2001년에 쓴 논문에 정리되어 있는 계산법들이다. 정리되기 오래 전부터 이미 다들 알고 있는 계산법이기도 했고.

위에서 보듯이, Regression에 기반한 모델을 여러개 Regression으로 구분하도록 구간별 평균값을 다르게 잡는게 Regression tree의 시작점이다.

역시 파비클래스 강의에서 계속 설명해왔던 내용인데, 여러 구간에 나눠서 Regression하는게 의미가 있는 경우(ex. 약에 반응하는 몸무게 구간이 여러개 나뉘어 있다는 가정)에만 Tree 계열의 모델이 의미가 있다. 예를 들어, c보다 작은 구간에서는 Regression이 별로 효과가 없는 반면, c보다 큰 구간에서는 Regression으로 특정 변수간 유의미한 관계가 두드러지게 나타날 수 있다.

Decision Tree라고 외부에 알려진 모델은 Y와 Y평균값 차이를 1개 변수에 한정해서 여러 스텝으로 반복하고, 구간을 나눌 때 0/1 형태로 구분하는 Step function을 Kernel로 활용하는 Regression Tree의 특수형태 중 하나다. 일반적으로 Regression Tree라는 명칭은 1 -> N개 변수에 대응할 수 있는 일반형 Tree 모델을 오랫동안 통계학에서 불러왔던 명칭이다. (참고로 이 모델을 중첩형으로 쌓으면 Neural Network가 된다)

위의 식에 Alpha값이라는 모델별 가중치에서 보듯이, Random Forest란 그런 여러 Tree 모델들에 각각 얼마만큼의 가중치를 배분해주느냐, 그래서 Stacking을 어떻게 하느냐는 계산이다. 차이가 있다면, Tree가 진화하는 구조 속에 데이터에서 알려주는대로 가중치를 나눠 배분하면서 구간을 쪼개가기 때문에, 좀 더 복잡한 구조를 가진 데이터일 경우에 적합한 모델을 얻을 가능성이 높아진다.

이해도를 높이기 위해 복잡한 구조를 가진 데이터의 예시를 하나만 들어보자. 몸무게 특정 구간 A, B, C, D, E 그룹 중 B와 D 그룹에서만 반응하는 약물이라고 생각하면, A, C, E 그룹과 데이터가 혼재된 상태에서의 Regression보다 구간을 여럿으로 쪼갤 수 있는 Tree가 더 효율적인 계산이고, 그런 구조가 단순히 몸무게 하나에서만 나타나는게 아니라, 키, 팔 길이, 다리 길이 등등의 다양한 신체 구성 요소의 범위에 제각각으로 영향을 받는다면, 이걸 Regression 하나로 찾아낸다는 것은 데이터 구조에 맞지 않는 계산이다. Tree로 모델을 만들고, 다양한 샘플에서 비슷하게 계속 맞아들어갈 수 있는 모델을 찾겠다면, Decision Tree 하나만 찾고 끝나는게 아니라, Random Forest를 이용해 여러 모델을 Model averaging하는게 적절한 계산법이다.

결론이 팔 길이 40cm - 45cm, 다리 길이 80cm - 85cm, 키 175cm - 180cm 구간과 각 값이 20%씩 더 뛴 구간에서만 약물이 효과가 있고, 그 외에 나머지 구간에서는 아예 효과가 없다면? 각 값이 10% 작은 구간과 10% 큰 구간에서는 아무런 효과가 없었다면? 일반적인 Regression은 그 약물이 그다지 효과가 없다고 결론 내리겠지만, Regression을 Tree를 이용해 구간으로 나눠보면 위의 특정 2개 구간에서만 두드러진 효과가 있음을 좀 더 쉽게 찾아낼 수 있다.

이런 구간별 효과는 Monotonic increase/decrease를 가정하는 기존 Regression 모델로 풀어내는데 한계가 있으니, 구간을 하나하나 다 뒤져보겠다는 관점에서 Algorithmic approach를 통해 (More specifically, Tree 모델을 통해) 그런 구간을 찾아낼 수 있도록 컴퓨터에 의존하는 것이다. (다만 Monotonic이 깨지는 경우가 그렇게 일반적이지는 않다. 팔 길이가 40cm, 50cm, 60cm +-1cm 인 구간에서만 효과가 있고, 나머지 팔 길이에서는 효과가 없는 약물이 과연 얼마나 될까?)

Neural Network vs. Factor Analysis

K개의 변수 X가 있다고 가정해보자. 그 중 실제로 숨겨진 변수 (Latent / Unobserved variable)인 Z는 총 K_1개가 있다고 하면,

Sigmoid 함수를 Kernel, 또는 (ML쪽 용어로) Activation Function으로 쓴다고 했을 때, 첫번째 Hidden Layer를 바로 위의 식으로 정리할 수 있다.

위에서 Beta는 ML에서 이야기하는 가중치이고, g(.)는 Activation function, K는 입력하는 변수의 숫자, K_1은 Node의 숫자, Z는 숨겨진 변수, epsilon은 회귀분석에서 말하는 오차항이다.

같은 논리로 Hidden Layer 1에서 Hidden Layer 2로 가는 식을 세울 수도 있다.

이런 식이 반복되는 구조가 Neural Network로 알려진 계산법인데, 파비클래스에서 설명해왔던대로, Activation Function을 단순 선형 함수로 쓰는 경우는 Linear Factor Analysis이고, 비선형 함수를 쓰는 경우는 Non-linear Factor Analysis이다. Factor Analysis와 동치인 이유는 Hidden Layer라고 부르는 곳에 있는 Node가 모두 숨겨진 변수 (Latent / Unobserved variable)이라는, 전형적인 Factor Analysis 계산의 결과값이기 때문이다. 숨겨진 변수를 정확하게 특정할 수 없기 때문에, FA 계산은 많은 경우에 "코에 걸면 코걸이, 귀에 걸면 귀걸이"라는 비난을 받는다. 글 앞 부분에 Network 모델이 가진 한계를 지적하던 부분과 일맥 상통한다.

정규분포의 합과 차는 정규분포이기 때문에, 입력 데이터가 정규분포인 경우에 출력값도 정규분포라고 가정한다면, 단순한 Linear Factor Analysis로 충분한 계산이다. 말을 바꾸면, Neural Network라는 계산이 필요한 데이터 프로세스는 입,출력 데이터가 모두 정규분포가 아닌 경우에 제한된다. Non-linear Factor Analysis가 필요하다는 뜻이기 때문이다.

같은 맥락에서 Deep Neural Network가 필요한 경우는, 여러번의 Factor Analysis가 반복되어야 하는 계산인 경우인데, 위의 정규분포 -> 정규분포 구조에서는 의미가 없다. 정규분포의 합과 차는 계속해서 정규분포를 결과값으로 내보낼 것이기 때문이다. DNN이라는 계산법이, 데이터가 위상구조를 띄고 있어서 Factor를 단번에 찾아내는게 어려운 구조, 그래서 여러 번의 Factor Analysis를 반복해서 위상구조의 깊숙한 곳을 찾아가야 원하는 데이터의 숨겨진 구조를 찾아낼 수 있는 경우에만 필요한 계산법이라는 것이 바로 이런 맥락이다.

경제학에서 Algorithmic approach를 안 쓴 셋째, 넷째 이유

위의 지식이 갖춰졌으니 경제학계에서 Algorithmic approach를 왜 안 다뤘을까에 대한 이유를 추가하면,

셋째, 경제학의 많은 데이터들이 Non-linear 패턴이나 구간별 효과값이 다른 경우가 거의 없기 때문이었다. 대부분의 X -> Y 관계는 Monotonic increase/decrease 관계를 갖고 있고, 그 패턴이 Non-linear하다고해도 Log값 기준으로 변화율간 관계, 특정 구간 (Equilibrium 근처)에서의 움직임을 보고 있으면 non-linearity가 대부분 제거된 구조를 보는 경우가 대부분이다. 위에서 보듯이, 대부분의 ML 방법론들이 기존의 통계학을 "단순히 다르게" 쓰는 계산법들이라 계량경제학을 하는 사람들이 몰랐을리가 없었음에도 불구하고 이용하지 않았던 가장 궁극적인 이유다. 필요가 없었으니까. 다루는 데이터가 달라지거나, 목적이 달라지는 경우에만 눈을 돌리겠지.

넷째, 사회과학 데이터는 Endogeneity가 있는 경우가 많기 때문에, Simultaneity, Mis-specification, Measurement error 등등을 다뤄주거나, Time series에서 Endogeneity 같은 모델 구조적인 문제가 아니어도 Noise를 제거해줘야하는 경우를 먼저 고민한다. 일단 알려진 or 짐작할 수 있는 문제를 제거하지 않고 데이터 작업에 들어가면 학자 자격을 의심 받는다. 반면, Algorithm approach는 그런 데이터 전처리를 깊게 고민하지 않고도 데이터 속의 패턴을 찾아낼 수 있다는 관점에서 활용되는 계산법인데, 데이터에 Endogeneity를 비롯한 모델 구조적인 문제 및 각종 Noise를 제거하지 않고 무작정 Algorithmic approach에만 기대봐야 원하는 값을 찾을 수도 없고, 우연히 찾아낸다고 해도 우연일 뿐이지, 계속 반복적으로 쓸 수는 없는, 즉 학문적 가치, 아니 지식의 가치가 없다고 판단하기 때문이다.

실제로 경제학자들이 모인 연구소에 ML, DL, RL 같은 Algorithm approach를 IT학원처럼 코드만 주워담는게 아니라, 제대로 수학적으로 빌드해서 강의하면,

에이~ 그걸 어떻게 써~

라는 말이 먼저 나온다. Noise 데이터에서 Noise를 제거하지 않고 Pattern을 찾을 수 있다는 "Algorithm approach"가 "사기"라는걸 바로 인지했기 때문에 즉, 사회과학 데이터에는 "틀린" 접근이라는걸 바로 인지하기 때문이다. Noise가 없고, 인과 관계 및 데이터 구조에 모델 구조적인 문제가 없는 데이터, 그런 고민 자체가 필요없는 데이터, 즉 이미지 인식, 자연어 처리 등등, Algorithm approach가 맞는 데이터에만 써야하는데, 그걸 모든 영역에 다 쓸 수 있다고 주장하는 국내 몇몇 공학도들의 우물 안 개구리 같은 모습을 경제학자들이 어떤 눈으로 볼지 충분히 이해되지 않나?

나가며 - ML방법론이 하늘에서 떨어진 방법론이 아니다

이 정도면 링크 건 Summary paper의 약 1/3 정도를 다룬 것 같다. 위의 설명이 어느 정도 길잡이가 됐을테니, 이해하는 독자 분들은 나머지 부분도 링크의 논문을 직접 읽고 이해할 수 있을 것이다. SIAI의 학부 고학년 수준 과목인 Machine Learning, Deep Learning, Reinforcement Learning 등의 수업 일부에서 위의 Summary paper를 다룬다. 그 수업 전에 배우는 다른 통계학 수업에서 배우는 방법론과 위의 설명처럼 하나하나 비교하며, 언제 어떤 경우에만 ML방법론을 쓸 수 있는지를 최대한 직관적으로 이해시키는 것이 그 계산과학 수업들의 목표다. (잘못 배워 나가면 Decision Tree 변형한 모델로 주가 예측하는 변수를 자동으로 찾아낼 수 있다고 망상하는 로보 어드바이저 회사 차리고 투자 받으러 돌아다닐 수도 있다.)

그런 이해도를 갖추게되면, 시장에서 ML, DL, RL을 적용해서 뭔가 엄청난 걸 해 냈다고 주장하는 언론 홍보의 실상을 좀 더 깊이있게, 냉혹한 시선으로 파악할 수 있을 것이다. 아마 학위 과정이 끝나는 무렵이 되면, MBA건 MSc 과정이건 상관없이, 위의 이해도 없이 코드만 갖다 붙여서 만든 결과물이 왜 제대로 작동하지 않는지, 그런 결과물에 시간과 인력과 돈을 쏟아붓는 작업이 얼마나 사회적 자원의 낭비인지, 그래서 제대로 된 지식을 볼 수 있는 시야를 갖춘다는 것이 단순히 연구 작업 뿐만 아니라 기업의 의사 결정과 생존, 발전에 얼마나 결정적인 영향을 미치는지 좀 더 열린 시야로 이해할 수 있게 될 것이다.

하늘 아래 새로운 것은 없다

라는 표현이 있다. ML방법론들, 좀 더 일반화해서 Algorithmic approach라는 것들이, 모델을 기반으로 하지 않고 모델이라고 판단되는 기본 식을 정리해보겠다는, 접근 관점의 차이만 있을 뿐, 사실 방법론들은 모두 기존의 통계학을 활용하는 계산법들에 불과하다. 즉, 기존의 통계학 계산법들이 못하는 걸 해내는 마법도 아니고, 기존의 방법론들이 가진 한계를 벗어나지도 못한다. 그냥 좀 "다른 관점"일 뿐이다.

단지, (특정한 몇몇 경우에만) 모델을 기반으로 하지 않고도 모델을 찾아내는 장점을 가진 반면, 분산, 검정력 등등의 수많은 통계학 도구들을 포기하는 계산법에 불과하다.

그런 한계를 명확하게 이해하고, Listen to Data를 하기 위해 현재 내가 가진 Data의 상황, 내 작업 목적 등등을 두루두루 감안해서 적절한 계산법을 선택하는 것이 진짜 Data Science아닐까?

믿을 수 없겠지만, 저 Summary Paper는 내가 석사했던 학교의 학부 2학년 Introduction to Econometrics 라는 수업의 읽기자료 및 Problem Set이다. 석사 레벨도 아니고, 학부 졸업반도 아니고, 학부 2학년 때 이미 ML, DL, RL이라고 불리는 계산과학적 접근법을 기초 계량경제학 수업 때 (Side로) 듣고(도) 이해하고, Problem Set을 풀 수 있는 수준의 교육을 받는다.

석사 공부하던 시절 내내 그들의 교육 수준에 충격 먹었었지만, 저 논문을 학부 2학년 수업 읽기자료와 연습문제에서 보고 말로 형용할 수 없는 충격을 받았었다. 우리나라 공대에서 자칭 AI한다는 교수들 중에 저 논문으로 만든 고급 연습문제 풀이는 커녕, 논문 자체를 이해하는 비율이 한 자리 숫자가 안 될텐데...

지극히 개인적인 견해를 덧붙이면, 경제학계에서 ML, DL, RL 으로 대표되는 Algorithm approach를 안 쓴 가장 결정적인 이유 (My version of 다섯째)는, 계량경제학자들이 통계학 훈련이 잘 되어 있어서 (최소한 공대보다는 잘 되어 있어서), 통계학 훈련을 하나도 안 받고 무조건 컴퓨터 신(神)님이 모든 문제를 해결해주실 것이라고 중세 신앙적 믿음을 갖는 공학도들보다, 인간의 지성을 더 중요시했던 르네상스 시대에 조금은 더 가까운 공부를 했기 때문이 아닐까 싶다.

Picture

Member for

8 months 2 weeks

Real name

Keith Lee

Bio

Professor of AI/Data Science @SIAI
Senior Research Fellow @GIAI Council
Head of GIAI Asia

GIAI Korea Books

AI/Data Science 연구이야기

2류 국가, 2류 인재

SIAI Yearbook (Korean)

AI/Data Science 강의노트

마케터 없이 천만 사이트 만들기

전략 컨설팅의 실패와 머신러닝의 관계 (2)

Picture

Member for

8 months 2 weeks

Real name

Keith Lee

Bio

Professor of AI/Data Science @SIAI
Senior Research Fellow @GIAI Council
Head of GIAI Asia

Input

2020-02-24 00:00

지난 글 이후로 많은 의견을 받았는데, 답변차원에서 2번째 글타래를 이어가본다.

지난 글에서 이미 컨설턴트의 '케이스 풀이법'에서 선형적 비지니스 접근의 한계에 대해서는 언급했으므로, 이번에는 실제 현업에서 비지니스 하는 사람들과 컨설턴트들의 차이를 살펴보자.

케이스 풀이법에서 슈퍼마켓 예시를 들었으니 같은 산업에서 스토리를 이어가보려 한다.

컨설팅 vs. 슈퍼마켓 지점장 사례

당신이 대형슈퍼마켓 지점장이라고 해보자. 컨설팅 회사 출신 본사의 전략기획 실장님께서 우리 지점 매출액 목표치가 '동네인구 x 시장점유율 x 1인당 장바구니 사이즈 x 52주'로 정리해서 보내주셨다. 이제 매출액을 어떻게 끌어올릴 수 있을까?

여기서 동네인구를 끌어올릴 수 있는 방법은 없다. 52주를 104주로 늘릴 방법도 없다. 1인당 장바구니 사이즈를 늘릴 수 있는 방법은 일시적으로 인기 상품을 들여와 가능할 수는 있으나, 장기적으로는 불가능에 가깝다. 결국 사람들이 집에서 요리를 해먹도록 만들어야 1인당 장바구니 사이즈가 늘어날텐데, 이는 본사에서도 하기 쉽지 않은, 트렌드의 변화를 불러일으켜야하는 일이다. 문화 현상에 대한 도전은 이 글에서 논외로 하기로 한다.

시장 점유율 끌어올리기

결국 지점장인 당신에게 필요한 내용은 우리동네에서 우리 슈퍼마켓이 경쟁마트보다 조금이라도 더 많은 고객을 끌어들이도록 해야한다는 것이다. 여기서 당신은 아래와 같은 세 가지 정도의 마케팅 전략을 취할 수 있을 것이다.

미끼 상품 몇 개로 사람들 유혹하기
상품 팜플렛 돌리기
입소문 내기

미끼 상품은 뭘로 고르면 될까?

컨설팅의 하향식 접근방식에 따라, 우리 동네에는 4인가구가 많고, 요리를 많이하고, 부자동네라서 고급 음식을 많이 먹으므로 1등급 한우의 비싼 부위를 싸게 판다고 결론 내리는 방식과, 지난 주에 제일 많이 팔린 상품 리스트 중 아직 재고가 많이 남은 상품 1-2개를 골라 재고를 처리하는 방식 중 당신은 어느 쪽을 더 선호하겠는가?

컨설팅의 접근 방식을 취하려면, 1등급 한우의 비싼 부위를 얼마의 가격에 팔아야할지, 몇 %의 손실을 감수해야할지, 홍보효과가 충분해서 사람들이 많이 올지, 방문해서 다른 상품도 많이 사갈지에 대한 구체적인 수치가 있어야하고, 해당 수치를 뒷받침하기 위한 충분한 근거가 필요하다. 예컨대 우리 마트 또는 옆 마트에서 비슷한 미끼 상품을 썼던 기록 데이터가 필요하다.

만약 데이터가 없는 상황이라면 직접 시행착오를 겪으며 숫자가 나오는 것을 봐야한다. 그러나 이러한 시행착오를 겪으며 매몰비용을 지불하는 것보다, 재고 남은 상품을 처리하는 것이 더 합리적일 수 있다. 최소한 이를 통해 할인판매에 얼마나 많은 사람이 반응하는지 알 수 있는 소중한 근거자료를 얻을 수 있기 때문이다. 이는 다음에 다른 미끼 상품을 기획할 때 사용할 수도 있을 것이다.

자, 다음 포인트를 보자. 팜플렛은 어떻게 만들어야되는가? 입소문은 어떻게 내야할까?

미끼 상품 케이스와 똑같다. 플랜을 만들어내려면 오랜세월 마트를 운영해온 경험이 있거나, 아니면 남들이 하는 것을 벤치마크로 삼아 내실이 들어찰 때까지 왜 그렇게 했을까 고민하면서 모방하는 수 밖에 없다.

싸게 파는데도 사람들이 안 온다

조금 더 깊게 들어가서, 아무리 싸게 상품을 내놔도 사람들이 우리 마트에 안 온다고 해보자. 마트 직원들이 고객들에게 설문한 결과 주차장이 불편해서 장보는 아주머니들이 낮에 차를 끌고 오지 못해서 불만의 목소리를 높이고 있다고 한다. 나아가 마트 차원에서 더 많은 주차가 가능하도록 하려고 차 간격을 좁게 했더니 문콕이 자주 발생하고 있다고 한다.

이를 컨설팅 방식의 하향식 접근법을 적용하면, 문 앞에서 고객들에게 설문지를 돌리면서 정보를 수집할 것이다. 그러나 고객이 솔직하게 답변을 하지 않을 가능성도 있으며, 문항 구성에 따라 답변에 대한 신뢰도가 천차만별로 달라진다.

주차장 문제를 해결했다고 하더라도, 최초 설계부터 잘못되었기 때문에 건축설계사무소를 고소하고, 설계 변경 비용 내고, 지방정부기관 건축팀 승인 및 안전 평가, 건설업체 결정 문제 등 매우 많은 시간과 비용이 투입되겠지만, 양보해서 해당 문제를 빠르게 해결해서 고객들이 마트에 많이 오게 되었다고 가정해보자.

오게 만들어 놔도 돈을 안 쓴다

고객들이 매장에 들어왔는데, 다른 마트보다 장바구니 금액이 적다.

컨설턴트는 우리 동네가 우리 동네는 어떤 소득, 소비 수준을 갖춘 사람들이 살고 있고, 그래서 상품은 어떤 종류를 소비하고, 그 중 어떤 상품은 브랜드 이미지가 어떻고, 그래서 어떤 종류의 상품을 더 많이 입점시켜야하는지 등의 '솔루션'을 제안할 것이다.

힌퍈 현장 경험이 많은 노련한 지점장은 고객들의 동선을 먼저 확인해볼 것이다. 특히 주차장 문제를 해결한 지점장이면 직접 장바구니를 들고, 혹은 가족 동반으로 장바구니를 들고 1주일치 장을 볼 것이다. 이를 통해 먹고 싶은 상품이 없다거나, 먹고 싶었던 상품이 잘 안보인다거나, 배치에 대한 설명이 부족해 고객이 헷갈리거나 포기하는 경우가 있다는 것을 파악할 것이다.

분석이 끝났으면 고객이 움직이는 동선을 의도적으로 조절할 수 있도록 매대의 폭을 조절하거나, 상품이 잘 팔리는 코너에 더 주력 상품을 배치하거나, 리베이트를 많이 주는 상품들 위주로 홍보 마크를 달아놓는 방향으로 매장 안에서 고객들이 장바구니를 더 풍성하게 채울 수 있도록 하는 환경을 조성할 것이다.

이런 지식을 공부하는 산업공학의 분과학문을 소매 인체공학(Ergonomics in Retail)이라고 하고, 이런 종류의 수업과 교재도 찾아볼 수 있다. 우리가 마트에서 보는 상품 배치는 관련 주제의 학문을 하는 연구자들이 기하학적 지식을 동원해 무수히 많은 시행착오로 만들어낸 결과물인 것이다.

현장 경험이 많은 노련한 업계 종사자들은 이런 것을 경험으로 익히고 있는 것이고, 박사들은 기하학이라는 수학 지식과 실험 기반의 통계적 연구로 그런 지식을 쌓은 차이가 있을 뿐이다.

머신러닝이 쓰일 곳

실제 건물에서 사람들의 움직이는 경로에 대한 데이터를 얻고, 그걸 바탕으로 추론을 하던 기존의 연구 방식이 혁명적인 변화를 맞은 건 역설적이게도 움직임을 잘 추적해주는 전파기기(Beacon)이 설치되어서가 아니라, 추적 자체가 매우 간편한 온라인으로 구매의 중식축이 이동했기 때문이다. 마트 산업에서 상품 위치를 어떻게 배치하는 것이 합리적인지 밝혀내기 위해 위의 사진처럼 기계를 착용하고 가짜 고객들이 직접 쇼핑을 하도록 데이터를 모으던게 불과 10년 전의 상황이었다.

온라인 쇼핑몰에서는 고객이 어떤 검색어로 어떤 상품을 봤고, 얼마나 긴 시간동안 그 페이지를 보다가 다른 페이지로 이동했는지에 대한 '지문'을 전부 보유하고 있다. 그 상품 하나만 보고 안 사고 나가버린 데이터 밖에 없으면 문제의 원인이 뭔지 알아내기 힘들 수 있으나 다른 상품을 결국 구매하는걸 보고, 두 상품 간의 차이에 대한 정보를 알아낼 수만 있으면 앞 상품이 왜 안 팔렸는지 쉽게 이해할 수 있게 된다. 고객들의 상품들을 구매하는 데이터가 쌓이기 시작하면, 상품의 가격이 문제였는지, 또는 어떤 특징이 문제였는지를 구분하는 정확도가 가파르게 올라가게 된다.

나아가 같은 카테고리의 상품 N개를 묶고 그 중 가격과 판매량 간의 상관관계가 얼마나 높은지 잡아내면, 해당 상품군은 상품 품질이 중요한지, 또는 가격이 중요한지에 대한 인사이트를 얻을 수 있을 것이다. 이는 산업조직론이라는 경제학의 분과 학문에서 오랫동안 해 오던 작업이고, 온라인 쇼핑몰에서는 시스템만 갖춰져있으면 불과 클릭 몇 번에 같은 정보와 결론을 얻을 수 있다.

좀 더 발전하면 상품 소개 문구 하나, 썸네일 사진에 쓰이는 색상, 썸네일 사진의 모델, 글자 폰트, 화면 색상 비지니스 운영자로서 해야할 고민들을 상당히 합리적인 숫자로 확정지을 수 있게 된다. 최근 이러한 내용을 산업공학과의 소매 인체공학에서 학생들에게 가르치고 있다.

즉 유저들의 행동을 뒤에서 보고, 설문지를 나눠주고, 질문을 하고 답을 얻어가면서 비싸게 얻었던 유저 행동 데이터를 온라인 마케팅 서비스에서 실시간으로 얻어낼 수 있다는 것이다. 물론 서비스가 제대로 작동되어야 유의미한 데이터를 얻을 수 있고, 그 전까지는 정보가 없는 상태에서 시작해야한다는 단점이 있다. 이 때는 온라인 마케팅 또한 시행착오를 겪으며 서비스를 구축해야한다.

필자는 짧지 않은 기간동안 학문과 비지니스를 하면서 한 분야의 지식이 성숙되어 가는 과정에 큰 공통점이 있다는 것을 발견했다. 지식은 어떤 단계에 있든 상관없이 작은 차이를 읽고 이를 어떻게 추상화하여 다른 분야에 적용할 수 있을까, 혹은 이 문제를 어떻게 해결할 수 있을까와 같은 문제의식에서 출발한다는 것이다.

두 영역간의 차이는 그 문제를 해결하는데 수학, 통계학 같은 학문을 쓰느냐, 직관과 경험을 쓰느냐의 차이일 뿐이다.

비즈니스 의사결정 구조의 진화 by 데이터 사이언스

설문지보다 훨씬 더 정확하게 인간의 선호를 보여주는 데이터, 현시선호 (Revealed preference)가 표시된 데이터를 손쉽게 얻을 수 있는 세상이 되었기 때문에 컨설턴트의 하향식 접근방식의 가치가 급격하게 떨어지는 시대가 도래했다. 오히려 통계학적 방법론을 이해하고, 더 복잡한 비선형 패턴을 찾아내는 머신러닝 지식을 가지고 있는 인재들에 대한 수요가 늘어날 수 밖에 없게 되었다.

위의 그림을 보면,

A타입: Top-down 형태의 컨설팅 형태의 지식 생성 과정
B타입: Bottom-up 형태의 하나하나 벽돌을 쌓아올려서 삼각 피라미드를 완성하는 과정
C타입: 삼각형 2개를 겹치는 복잡 구조물을 만드는데 하나하나 데이터의 검증을 받는 과정

이 있다. 지금까지 비지니스가 B타입이었기 때문에 대부분이 A 방법론으로 접근해도 될 것이라고 생각해왔다. 그러나 최근들어 C타입처럼 복잡하고 특화된 업무가 생겨나면서 내가 맞는지에 대한 시행착오가 비약적으로 늘어나게 됐고, 시행착오를 비싼 비용을 들여 가짜 유저를 투입시키지 않고도 데이터를 이용해서 높은 정확도를 기대할 수 있게 되었다. 업무는 C타입으로 복잡해졌고, 검증은 온라인 유저 데이터를 이용하면 빠르게 진행되므로 B타입 시절에 A 방법론을 쓰던 컨설턴트가 가지는 엣지가 급격하게 감소하게 된 것이다.

t-Test, A/B Test 같은 1회성의 1변수, 2변수 테스트가 흔히 알려져 있는 통계 테스트 방식이고, 베이지안 방법론을 이용해서 Bandit problem을 풀어내는 연속성의 테스트도 같은 클래스의 통계학 이용법이라고 볼 수 있다.

지난 글에서 선형적인 결과만을 내놓는 업무를 하는 직장에 있으면 2000년대 초반의 지식 체계에 머무를 것이라고 언급한 바 있었다. 요즘 데이터 사이언스 붐이 생기는 이유, 이런 붐이 지속되기 위한 필요조건은 C 타입의 업무가 얼마나 많아지고 사람들에게 받아들여지느냐에 달려있다. 단순한 이미지 인식이나 자연어 처리는 연구실에서 고민한 함수를 적용하는데 그치지만, C 타입의 업무는 본인의 데이터 사이언스 내공을 실무에 적용하는 비지니스 감각에도 영향을 받을 것이다.

나가며 - 전략 컨설턴트 접근법의 몰락

컨설팅 스타일의 Top-down 접근법 자체가 틀린 것은 아니다. 그러나 그 접근법은 지식을 체계적으로 쌓아 올리는 과정이 아니라, 정확도를 희생하고 속도를 올리려는 접근법이다. 이는 현실상황이 복잡해질 수록 그 한계가 명확하게 두드러질 수 밖에 없다.

우리 시대에 더 이상 컨설팅적 접근법이 효과적이지 않은 가장 큰 이유는 이제는 더 이상 해당 접근법이 적용될 수 없을 만큼 사회가 다원화, 복잡화되었기 때문일 것이다. 좁은 구역 주차를 수천번해보던 전문가가 아니라면, 주차장 모양만 보고 바로 문제점을 지적할 수 있는 소매 인체공학 전문가가 아니라면, 자기 마트 주차장의 문제는 직접 주차 실패를 겪으며 몇번 차를 긁어봐야 알 수 있을 것 같다.

Picture

Member for

8 months 2 weeks

Real name

Keith Lee

Bio

Professor of AI/Data Science @SIAI
Senior Research Fellow @GIAI Council
Head of GIAI Asia

GIAI Korea Books

AI/Data Science 연구이야기

2류 국가, 2류 인재

SIAI Yearbook (Korean)

AI/Data Science 강의노트

마케터 없이 천만 사이트 만들기

전략 컨설팅의 실패와 머신러닝의 관계

Picture

Member for

8 months 2 weeks

Real name

Keith Lee

Bio

Professor of AI/Data Science @SIAI
Senior Research Fellow @GIAI Council
Head of GIAI Asia

Input

2020-02-10 00:00

우리 회사에 전략 컨설팅 방식의 비즈니스 접근 방식을 좋아하고, 그 방법으로 비지니스 의사결정을 안 하고 있는 상황을 잘못되었다고 지적하는 직원이 하나 있다.

그 분의 접근 방식이 왜 잘못되었는지를 설명하다보니, 해당 설명이 왜 선형 회귀에서 비선형 회귀 또는 머신러닝으로 계산 알고리즘의 중심축이 이동하고 있는지와 맞닿아있는 것 같아 글을 한번 정리해본다.

전략 컨설팅에서 하는 '케이스 풀이법'

우선 전략 컨설팅에서 하는 '케이스 풀이법'을 한번 살펴보자. 우리 동네 대형 슈퍼마켓의 매출액을 가늠한다고 할 때, 1가구 4인으로 가정하고, 각 가구별로 1주일에 한번 장을 보러 간다고 가정하고, 한번 장을 보러갈 때 고기, 야채, 우유 등등 기본 품목과 가끔씩 사는 품목을 생각하면 약 20만원의 장바구니가 나온다고 가정해보자. 우리 동네 인구는 120만명이고, 대형 슈퍼마켓이 3개 있고, 각각의 시장 점유율은 33%씩이라고 하면, '30만 가구 X 시장점유율 33% X 장바구니 20만원 X 52주' 라는 방정식을 통해 그 대형 슈퍼마켓의 매출액을 가늠할 수 있다.

조금 더 세부적으로는 인구, 가구별 인원 수, 시장점유율, 장바구니 사이즈 등등에 대해 더 많은 자료를 붙여서 좀 더 정확한 예측을 해 볼 수 있을 것이다. 해당 방식에 대해서는 시대 상황이 많이 달라졌기 때문에 방법론이 달라졌을 수는 있겠으나 컨설팅 업계에서 논리를 구조화하는 방법은 여전히 동일하다고 알려져있다. 이를 가정하고 전략 컨설팅에서 하는 '케이스 풀이법'에 지적을 하자면, 해당 접근법은 매우 많은 영역에서 틀린 이야기를 담고 있다.

모든 것은 가정에 불과

첫 번째로 위의 방정식은 수많은 가정 위에 만들어졌는데, 그 가정이 맞는지 확인된 바가 없다는 사실이다. 예컨대 우리 동네 인구 12만명, 대형 슈퍼마켓 3개와 같은 실제 숫자를 제외하면 확실하게 사용할 수 있는 수치가 하나도 없다. 즉, 불완전한 가정을 기반으로 하고 있는 위 방정식은 다음과 같은 질문에 맞부딪치게 된다.

1가구는 4인으로 구성되는가?
1인당 소비하는 장바구니는 (최소한 합계금액이라도) 비슷한가?
1주일에 한번 방문하는 것이 맞는가? 다른 주기로 방문한다면 1주일 평균 장바구니 금액은 비슷한가?
대형 슈퍼마켓 대신 편의점이나 집 앞 소형 마트에서 구매하는 비중은 얼마인가?
시장 점유율은 항상 고정인가? 수시 판촉 행사로 매일처럼 널뛰기 하는건 아닌가?
설, 추석, 연말 등의 시즌에도 매출액 비중은 비슷한가?

위 방정식은 52주를 곱하고 있기 때문에, 제기된 질문에 조금만 다른 답을 하더라도 이는 전체 매출액 예측치에 매우 큰 영향을 주게 된다.

각 변수간 연관관계는 곱셉이나 덧셈같은 선형(Linear) 관계인가?

앞서 30만 가구 중 33%가 우리 슈퍼마켓을 방문한다고 가정했다. 그렇다면 우리 동네 30만 가구의 소득 수준은 비슷할까? 소득 수준이 비슷하다면, 혹은 다르더라도 소비 수준은 비슷할까? 소비 수준이 비슷하다고 가정하더라도, 소비 수준이 평균과 분산만 쳐다보면 되는 단순한 정규분포 구조를 따르고 있을까?

경제학 교과서에 나와있는 소득 구조에 대한 그래프들은 최상위 층이 대부분의 소득을 독점하고 있다고 입을 모아 말하고 있다. 또한 각 국가별 분배구조가 어떻게 이루어지고 있는지를 밝히는 수 많은 지표들이 존재한다. 일각에서는 지니계수를 반영하여 해당 방정식을 계산하면 된다고 지적하지만, 소득구조는 대부분 포아송 분포 또는 로그 정규분포 형태를 갖기 때문에 지니계수를 반영한다고 해서 계산이 정확해지기는 어렵다. 또한 정규분포, t분포와 같은 좌우대칭형 종 모양 분포가 아닌 분포들은 3차 모먼트(moment) 이상의 정보를 반영해야 정확한 계산이 가능해진다. 예컨대 자료의 대부분이 왼쪽으로 치우친 분포에서는 평균과 최빈값의 차이가 발생한다. 이 상황에서 단순 평균을 계산한다면 과다계상된 매출액이 나올 것이다.

시장이 고정적일 것이라는 가정 또한 비판의 여지가 존재한다. 동네 마트 주변에 수시로 다른 업체가 진입하고 있거나, 주민들이 다른 동네 마트를 쉽게 다녀올 수 있는 구조거나, 이커머스 회사들의 할인 쿠폰 마케팅 한번에 매출액이 엄청나게 크게 움직일만큼 시장 상황이 다변수에 영향을 받고 있다면 해당 방정식을 통한 예측 정확도는 현저히 낮아질 것이다.

선형 방정식으로는 더 이상 불가능한 도전

앞서 언급했듯 위 방정식의 각 변수간 연관관계는 단순 곱셈이나 덧셈으로 표현해서는 안된다. 수학을 이용한 모델링을 하는 분야의 경우 단순 곱셈, 덧셈으로 구성된 인과 관계를 선형(Linear) 관계가 있다고 표현하고, 평균 및 분산 이외의 3차, 4차 모먼트를 확인하거나 제곱, 세제곱과 같은 고차항을 포함한 여러 변수들의 결합된 영향을 보거나, 지수 함수 수, 로그함수 등의 형태를 이용해야하는 경우에는 비선형(Non-linear) 관계를 가지고 있다고 표현한다.

해당 예시로 돌아와서 2인 가구일때 10만원의 장바구니였다고 4인 가구에 20만원의 장바구니를 구성하는게 아닐 수 있다. 식구가 늘어나면 집에서 요리해 먹는게 저렴하고, 2인가구는 맞벌이 부부라면 외식을 하는 일이 잦을 수 있기 때문이다. 나아가 6인 가구의 경우 아파트 단지 평수 및 도시-농촌 거주 여부에 따라 2인가구에서 4인가구로 갔던 장바구니 니 구매액 관계가 4인 가구에서 6인 가구로 갈 때 그대로 성립하지 않을 것이다.

아래의 그래프를 하나 보자.

직선이 A
거의 진폭이 없는 그래프가 B
가장 아래 위 진폭이 심한 그래프가 C

이다.

선형 관계를 가진 A 함수를 이용해서 진폭이 매우 심한 C 함수를 매칭하는 경우, 위의 그래프에서는 가로축 좌표값이 (0,1,2,3,4,5)인 경우, 6번만 일치한다. 다시 말해서 전략 컨설팅의 선형 '케이스 풀이'를 통해 변수간 비선형관계를 복잡한 현실 세계를 예측한다고 했을 때, 높은 수준의 정확도를 기대하기는 어렵다는 것이다.

그동안 인류가 선형 관계식으로 대부분의 문제를 풀 수 있었던 이유는 많은 문제들이 B 정도의 현실복잡도를 갖고 있었기 때문이거나, 우리가 목표하는 단위가 분기, 년 등으로 시점이 정해져 있어 특정 포인트(위 그래프에서는 1,2,3,4,5에 해당한다)들만 예측하면 되었기 때문이다.

시간을 더 들여서 예측 정확도를 높일 수 있다?

일각에서는 제대로 시간과 노력을 들여 모델(방정식)을 만든다면 C 그래프를 만들어 낼 수 있다고 말한다. 그러나, 제 3자의 개입이 비주기적으로 나타나면서 시장을 바꾸는 상황을 위 단순 예측 모형으로 맞추려고 하는 것 자체가 근본적으로 잘못된 접근에 해당한다. 또한 C 그래프 또한 평면 위에 있기 때문에 1변수 모델의 한계를 뛰어넘지 못하고 있다. 3, 4, 5차원 공간으로 그려야할 다양한 변수가 영향을 미치고, 각 변수들이 본인이 통제할 수 없는 상황에서 위와 같은 단순 방정식이 좋은 퍼포먼스를 보여주기는 어렵다.

위와 같은 상황의 경우 시간을 더 들여서 A 그래프를 C 형태 그래프로 맞추는 것이 아니라, 마르코프 결정 과정(Markov Decision Process)의 Action, Strategy, Outcome 조합을 두고 시뮬레이션을 하는 Q-learning과 같은 절차로 복잡한 변수간 비선형 관계를 갖는 현실 세계를 모델링하는 작업이 필요하다. 다시 말해서, 시간을 더 들여서 고민해보는 문제가 아니라, 완전히 다른 레벨의 지식이 필요하다는 것이다. 이를 더 구체적인 예시로 확인해보자.

중앙은행이 통화정책을 결정하는 보고서를 만드는 프로세스

방법1. 경제학자가 풀어내는 방식

당신이 거시경제학에서 초저금리 중에 팽창 통화 정책은 실물 경제에 직접 영향을 미치지는 못하지만 초단기채권 금융시장에 교란을 줘서 투자자들이 장기채에 투자하게 되도록 만드는, 결국 장기채 금리가 내려가서 기업들의 장기채 발행을 유발할 수 있다는 종류의 논문을 쓰고 경제학 박사 학위를 받은 다음, 중앙은행의 금융통화정책 결정팀 핵심 연구위원으로 취직했다고 가정해보자.

금융통화위원회 위원이 당신에게 이번에 금리를 올리자고 이야기를 하고 싶은데, 어떤 효과가 있는지 보고서를 하나 만들어 달라고 했다고 해 보자. 이를 위해 당신은 우선 금리 올리려면 중앙은행이 취하는 정책 수단이 뭐가 있는지 살펴볼 것이다. 예를 들면 통화안정채권을 시장에 팔 수도 있고, 시중은행들에게 이연평잔을 계산하는데 압박을 줄 수도 있다.

그 중 현재 시장 상황, 특히 지난 몇 달간의 금융 시장을 보고 합리적이라고 판단되는 몇 가지 정책 수단을 골라 어떤 규모로 시장 개입을 하면 0.25%, 0.5% 금리가 오를 것이며, 거기에 맞춰 금융시장이 얼마나 위축되고, 따라서 실물 경기가 얼마나 위축될지를 계산하기 위해 거시경제모형에 수치를 입력해볼 것이다. 그 모델은 몇 백개의 변수가 뒤얽혀 있을텐데, 박사 시절에 공부할 때는 거의 대부분의 변수를 고정시켜놓고 내 연구 주제에 직접 영향을 주는 인과 변수들 몇 개만 넣은 소형 모델을 봤었는데, 현실 상황의 모델이 매우 복잡해져서 기존 연구와는 다른 새로운 도전을 하게 될 것이다.

또한 매우 높은 확률로 결과가 예상과는 다르게 출력되어 이를 보정해주어야하는데, 이를 위해 박사 시절 봤었던 논문의 시장의 특정 상황을 고려하는 변수를 추가를 해보면서 결과를 보정하는 작업을 계속 거치게 될 것이다. 심지어는 본인이 연구할 때 확인했던것처럼 장기간 초저금리가 지속되고 있기 때문에 그 논문과 본인 논문을 결합하는 새로운 모델을 만들고 테스트를 해 봐야할 수도 있을 것이다.

방법2. 컨설턴트가 풀어내는 방식

지난 10년간 금리를 올린/내린 데이터를 살펴보고, 0.25%, 0.5%씩 증감이 총 15차례 있었다는 데이터, 그 때 각각 경제성장률, 각 산업별 성장율이 어떻게 변했는지 중앙은행의 분기별 산업 보고서를 인용해서 그래프를 만들 것이다. 그 다음엔 IS-LM 같은 더 이상 경제학계에서는 잘 쓰이지 않는 그래프를 놓고 LM 커브가 좌측으로 이동하면서 경기 위축이 발생하는데, 그 움직임은 아까 위에서 구한 저 그래프의 결과값과 비슷하도록 그래프 보정을 해 놓을 것이다.

또는 초저금리로 인해 경제 정책이 반영되는 방식이 달라진다는 사실을 자사 컨설팅에 반영하기 위해 월 스트리트 저널에서 15년 경력의 기자를 전문가로 초빙해 자문을 구할 것이다. 경제 정책에는 외연이 깊지 않은 그 기자가 미국에서는 초저금리 때문에 연방준비제도(FED)에서 관련 연구를 하는 Harvard, MIT, U Chicago, Princeton 출신 경제학과들을 대규모로 초청한 프로젝트를 진행중이라며 관련 조언을 해준다면, 컨설턴트는 이자율을 올리면 각 산업별로 어떻게 영향을 받아서 산업별 성장률이 영향을 받고, 산업별 비중이 우리나라에 어떻게 나뉘어 있는데, 그 두 값을 곱한 값으로 국가의 경제성장률이 어떻게 움직일 것이라고 예측하는 단순하지만, 겉으로 보기에는 매우 화려해보이는 PPT를 준비할 것이다.

왜 전략 컨설팅이 사양산업이고, 머신러닝이 떴을까?

저 위의 전략 컨설팅 스타일 발표 자료와 논리 전개 방식은 높은 수준의 전문 지식을 필요로 하지 않는다. 자기 산업 분야에 특화되어 있는 몇몇 업계별 경력직 컨설턴트가 아니면, 대부분이 업계 지식 및 논리를 통해 수백개의 가정을 이용해 본인들의 논리를 만들어내고 있다.

하지만 위의 논리는 대부분은 선형 관계의 논리에 국한되어있다. 학문적인 깊이, 사고의 깊이, 지식의 깊이를 통해 남들이 보지 못하는 지점을 바라봐야하는데, 그러한 논리들은 다른 차원의 지식을 체화하면서 만들어낼 수 있는 논리, 즉 비선형관계에 기반한 논리들이기 때문이다.

머신러닝 혹은 데이터 마이닝, 데이터 사이언스는 데이터에서 쉽게 찾아낼 수 없는 다중 패턴(Mutl-pattern), 비선형 패턴을 찾아내기 위한 계산통계학적인 방법이다. 기존에는 거의 대부분의 데이터 분포가 정규분포에 수렴했고, 그래서 특별히 비선형 패턴을 찾아야할 필요가 없었다. 그러나 오늘날은 슈퍼마켓 매출액 하나를 계산하더라도 다양한 시장참여자들의 행동을 모두 고려하는 마르코프 결정 과정이 필요한 시대고, 위의 선형 방정식으로 남들이 보지 못하는 인사이트를 도출하는 것은 불가능에 가깝다.

나가며 - 산업계가 컨설턴트를 외면하는 이유

A를 위한 방정식을 만드느니 경험이 많은 업계 종사자들은 차라리 경험에 따른 자신의 직관을 믿고 사업을 할 것이다. 컨설턴트에게 수억원을 주고 받은 A보다 경험에서 나온 직관이 B에, 어쩌면 C에 가까울 확률이 더 높기 때문이다.

위에서 지적한대로,

무수히 많은 가정에 기반해야하고,
그 가정들이 제대로 검증되지 않은 경우가 빈번하고,
더 나아가서 그들의 결론은 Linear 방정식에 국한되어 있어서

오늘날의 전략 컨설턴트가 가지고 있던 엣지는 점차 사라지고 있다. 데이터는 비선형 방정식을 필수적인 접근 방법론으로 삼을만큼 고도화 되었고, 컨설턴트의 선형적 문제 접근은 우리시대의 비즈니스에 유용한 인사이트를 주기에는 명확한 한계점이 존재한다.

Picture

Member for

8 months 2 weeks

Real name

Keith Lee

Bio

Professor of AI/Data Science @SIAI
Senior Research Fellow @GIAI Council
Head of GIAI Asia

GIAI Korea Books

AI/Data Science 연구이야기

2류 국가, 2류 인재

SIAI Yearbook (Korean)

AI/Data Science 강의노트

마케터 없이 천만 사이트 만들기

Subscribe to Feed SQ

Published

1. Introduction

1.1 Background

1.2 Objectives

2. Key Concepts and Methods

3. Data Description

3.1 Introduction

3.2 Data Preprocessing and Assumptions

4. Data Modeling

5. Results

6. Ensemble Methods

7. Conclusion

References

Published

1. Introduction

2. Literature review

3. Materials and method

3.1. Decomposition of auction sale rate

3.2. The data

3.3. Identification of variables

3.3.1. The effect of market price

3.3.2. Component identification

3.3.2.1. Fourier transform

3.3.2.2. Regression analysis

3.3.3. Proof of the effect of appraisal price

3.3.4. Proof of the effect of premium price

3.3.4.1. Distinguish to price premium pffect in auction sale rate

3.3.4.2. Momentum factor

3.3.5. Time varying beta to capture price premium section

3.3.5.1. Kalman filter

3.3.5.2. Experiment

4. Conclusion

References

Published

1. Introduction

1.1 Seminal work: topic modeling research

1.2 Research objectives

2. Problem definition

2.1 Existing industry-specific keywords analysis

2.2 Proposed model for textual data handling

2.3 Scope and contribution

3. Literature review

3.1 Non-graph-based method

3.1.1 Latent Dirichlet Allocation (LDA)

3.1.2 Latent Semantic Analysis (LSA)

3.1.3 Neural Topic Model (NTM)

3.2 Graph-based methods

3.2.1 Global random topic field

3.2.2 GraphBTM

3.2.3 Graphical Neural Topic Model (GNTM)

3.3 Visualization techniques

3.3.1 Fast unfolding of communities in large networks

3.3.2 Uniform Manifold Approximation and Projection (UMAP)

3.3.3 Agglomerative Hierarchical Clustering

4. Method

4.1 Graphical Neural Topic Model(GNTM) as Factor analysis

4.2 Akaike Information Criteria (AIC)

5. Result

5.1 Model setup

5.1.1 Data

5.1.2 Parameters

5.2 Evaluation

5.2.1 AIC

5.2.2 Topic interpretation

5.2.3 Classification

6. Discussion

6.1 Limitation

6.2 Future work

7. Conclusion

References

Appendix

Member for

입력

수정

지난해 대비 채용계획 기업은 5.6% 하락

팬데믹 이후 회복되는 고용시장에 나타난 양극화 현상

청년 실업률 개선되고 있지만 ‘불안정한 일자리’ 위주

Member for

관련기사

Member for