Skip to main content

애플도 스페이스X도 피소당했다, 매서운 美 고용차별 단속

애플도 스페이스X도 피소당했다, 매서운 美 고용차별 단속
Picture

Member for

6 months 3 weeks
Real name
김서지
Position
기자
Bio
[email protected]
매일같이 달라지는 세상과 발을 맞춰 걸어가고 있습니다. 익숙함보다는 새로움에, 관성보다는 호기심에 마음을 쏟는 기자가 되겠습니다.

수정

외국인 근로자 '펌' 직위 채용 회피한 애플, 2,500만 달러 합의금 내야
채용 과정서 피난자·난민 배제한 일론 머스크 '스페이스X'도 덜미 잡혔다
'미국인 차별'도 잡힌다? 임시직 외국인으로 자리 채운 페이스북도 '피소' 

애플이 미국 법무부가 제기한 고용 및 채용 차별 혐의에 대해 합의했다. 9일(현지시간) 월스트리트저널(WSJ)은 애플이 미국 법무부가 제기한 '정부 노동 인증 프로그램(펌·PERM)' 관련 고용 차별 혐의를 벗기 위해 2,500만 달러(약 330억1,250만원)에 합의했다고 보도했다. 미국 법무부가 휘두르는 '고용차별 단속' 채찍에 조용히 꼬리를 내린 모양새다.

외국인 근로자 노동 인증 '펌' 회피 혐의

펌은 기업이 미국에서 외국인 근로자를 영구적으로 고용할 수 있도록 지원하는 노동 인증 프로그램이다. 미국에서 일하는 외국인 직원이 특정 요건을 충족할 경우, 고용자는 외국인 직원의 EB-2(취업이민 2순위, 고학력 전문직) 비자를 신청해 근로자의 합법적인 영주권 자격을 후원할 수 있다.

애플은 그동안 펌 채용을 회피하거나, 관련 사항을 안내하지 않았다는 혐의를 받고 있다. 특히 노동부 펌 사이트가 아닌 우편을 통해서만 펌 채용 신청서를 수락했다는 점이 문제로 지목됐다. 전자 문서로 접수된 특정 신청서를 배제했다는 것이다. 미국 법무부는 "애플의 비효율적인 채용 절차로 인해 취업 허가가 유효한 지원자의 펌 직위 지원이 거의 또는 전혀 발생하지 않았다"고 지적했다.

WSJ에 따르면 애플은 민사소송 벌금 675만 달러(약 89억1,337만원), 차별 피해자를 위한 기금 1,825만 달러(약 240억9,912만원)를 납부하게 된다. 또 합의안에 따라 채용 웹사이트에 펌 직위에 대한 안내를 게시하고, 지원서를 디지털 방식으로 접수해 광범위한 펌 직위 채용을 수행할 예정이다. 이와 관련해 애플 측은 "우리가 의도치 않게 법무부 표준을 따르지 않았다는 사실을 깨달았다"며 "문제 상황 해결을 위한 합의안에 동의했고, 미국 근로자를 계속 고용할 것"이라고 밝혔다.

스페이스X도, 메타도 '고용 차별' 피소

이는 비단 애플만의 문제가 아니다. 일론 머스크의 항공우주 장비 제조·생산 기업 '스페이스X' 역시 지난 8월 고용 차별을 이유로 미국 법무부에 피소당한 바 있다. 미 법무부는 스페이스X가 2018년 9월부터 2022년 5월까지 난민 및 피난민을 지원 및 고용하지 않았으며(시민권 상태 기준), 이는 미국 이민·국적법(INA)을 위반하는 행위라고 판단했다.

실제 스페이스X는 채용 과정 전반에서 피난자와 난민을 배제한 것으로 알려졌다. 이후 스페이스X는 미 수출통제법을 이유로 시민 및 영주권자만이 스페이스X에 입사할 수 있다는 주장을 펼쳤으나, 법무부는 이를 받아들이지 않았다. 우주 관련 첨단 기술을 개발하는 스페이스X가 수출통제법상 국제 무기 거래 규정 및 수출 관리 규정 등을 따라야 하는 것은 사실이나, 이 법이 망명자·난민과 미 시민권자·영주권자의 차별 대우를 요구하지는 않는다는 것이다. 미국 법무부는 법원에 "스페이스X에 벌금을 부과하고, 향후 차별 금지 의무를 준수할 수 있도록 회사 정책을 변경하게 해달라”고 요청했다.

반대로 '미국인'을 채용에서 배제해 피소당한 사례도 있다. 지난 2020년 미국 법무부는 페이스북(현 메타)이 외국인 임시직 노동자들을 우선 채용해 미국 노동자들을 차별했다고 주장하며 소송을 제기했다. 2,600명의 인력을 채용하는 과정에서 미국 노동자 고용을 거부하고, 대신에 이들 자리를 H-1B 등 임시 비자를 소지한 저임금 외국인 노동자로 대체했다는 지적이었다.

Picture

Member for

6 months 3 weeks
Real name
김서지
Position
기자
Bio
[email protected]
매일같이 달라지는 세상과 발을 맞춰 걸어가고 있습니다. 익숙함보다는 새로움에, 관성보다는 호기심에 마음을 쏟는 기자가 되겠습니다.

STA501 Mock exam - F2023

STA501 Mock exam - F2023
Picture

Member for

6 months 3 weeks
Real name
Keith Lee
Bio
Head of GIAI Korea
Professor of AI/Data Science @ SIAI
\begin{document}
	
\begin{center}
	\textbf{\Large STA502: Math \& Stat for MBA I \\ \bigskip Mock Exam F2023}\\
\end{center}

\begin{question}
	As a recent graduate of SIAI's renowned MBA in AI/BigData, you just got hired at one of the Fortune 500 companies. As with all rapidly growing tech companies, the compensation package is incredibly generous with handsome stock options, but it also has reputation of egocentric seniors in all levels. On the very first day, at a town-hall meeting, you were able to witness two founding members of the company, still young and energetic, in heated discussion in the perspective of benefits and costs of running an online community for AI specific contents. \\
	
	Boss A claims that the community needs more members to grow fast, and by the network effect, the number will grow, at some point, exponentially, which will eventually credit the company as an "AI expert's company". On the other hand, Boss B argues that the community needs more contents, especially supreme quality contents, to attract more people. Boss B thus insists that the company hire reputed data scientists across the states. However, since Boss A believes number of people pre-defines the community's recognition, he suggested to invite entry software engineers with little to no knowledge in scientific aspect of AI/Data Science. He continued that there are thousands of millions of people learning entry language called PyR, thus the community should provide basic libraries written in PyR to attract them. Boss B, conversely, claimed that advanced languages like Julian, despite limited user base, should be the company's focus to earn 'AI Expert's company' badge. \\
	
	After watching the impassioned debate, your data science team's boss gave you an assignment to verify which variable is a key to winning online community, between a large user base($U$) and amount of quality contents ($C$). \\
	
	Since you believe, for a successful community, both user and contents bases are simultaneous necessity and sufficiency, you wonder if this analysis has to incorporate simultaneity model that you learned in Math \& Stat for MBA. Still unsure what model can recover true nature of successful and reputed community, you have designed following two simple relationships: 
	\begin{align*}
		C &= \alpha_1 + \alpha_2 U + \alpha_3 M_2 + u \hspace{1cm} (1)\\
		U &= \beta_1 + \beta_2 C + v  \hspace{2.1cm} (2) 
	\end{align*}
	where $M_2$ may be assumed to be an exogenous variable for anything related to a successful community and $u$ and $v$ are identically and independently distributed disturbance terms with zero means. The observations for $M$ are drawn from a fixed population with finite mean and variance.
	\begin{enumerate}
		\item[1.] Derive the reduced form equation for $C$. (5 marks) \\
		
		\item[2.] Demonstrate that the OLS estimator of $\beta_2$ is, in general, inconsistent. How is your conclusion affected in the special case $\alpha_2 = 0$? How is your conclusion affected in the special case $\alpha_2 \beta_2 = 1$? What do these special case mean in words? (5 marks) \\
		
		\item[3.] Demonstrate that the instrumental variables (IV) estimator of $\beta_2$, using $M_2$ as an instrument for $C$, is consistent. Why do you need an IV estimator? (5 marks)\\
		
		\item[4.] Instead of using IV estimation, the researcher decides to use 2-Stage-Least-Square (2SLS) in the expectation of obtaining a more efficient estimator of $\beta_2$. He fits the reduced form equation for $C$: 
		\begin{align*}
			\hat{C} = k_1 + k_2 M_2 \hspace{2cm} (3)
		\end{align*}
		saves the fitted values, and uses them as an instrument for $U$ in equation (2). Demonstrate that the 2SLS estimator is consistent. (5 marks) \\
		
		\item[5.] Determine whether the researcher is correct in believing that the 2SLS estimator is more efficient than the IV estimator. (5 marks) \\
		
		\item[6.] How do you prove that IV (or 2SLS) estimation is superior to OLS? (5 marks)  \\
		
		\item[7.] If you have $M_1$ for equation (2), as is $M_2$ for equation (1), can you have any better result? If so, in what context? Can you argue more instruments promise better results? (5 marks) \\
	\end{enumerate}
	
	At that point, your data science team's boss asked your interim report, and you told him that you are looking for pertinent instrumental variable sets to strengthen your argument. You boss, however, is a strongly dis-believer of scientific models and has a firm belief on machines. He asks you to complete the research as soon as possible just by applying all machine learning models from PyR library and choose the most matching model. \\

	\begin{enumerate}		
		\item[8.] Can you extend your logic in 6) to disprove a claim that machine learning model is superior to OLS? Assume that your model's errors, even after 1st-stage data pre-processing, follow Gaussian distribution jointly. What happens if non-Gaussian? (5 marks) \\
		
		\item[9.] Having been benchpressed by your logic, your boss, with a firm belief on machine learning, claims that adding a quadratic term, instead of IV or 2SLS, is far more superior estimation strategy, because he believes non-linear \& non-parametric estimation by computers are better than human's faulty logical thinking, as was witnessed by Alpha-Go and abundant achievements by "Artificial Intelligence". Provide your rebuttal. (10 marks) \\ 
	\end{enumerate}
\end{question}


\begin{question}
	Since year 2040, when the government Sirius announced full scholarship to all domestic AI/Data Science programs in universities, there has been on-going debate whether the education actually is rewarding, in terms of quality and future wage of labor. Not all companies have vigorously hired AI grads, and after 15 years of record, one researcher wants to clarify if AI adoption by active hiring really helped companies to grow faster. \\
	
	The following regressions are for 9,125 AI/Data Science graduates in 2050. The data science strategy of the study is to compare various outcomes that helped companies to grow (assets, sales, and number of employees) for AI-trained and no-AI-trained in 2055. The regressions also interact AI grads' status with labor market's appreciation of the quality which is reflected in wage growth, $W_g$. Assume that the labor market of the country is efficient, that higher wage means higher productivity, at least in the field of data science. Wage growth is coded so that growth of 7\% would be 0.07. \\

	\begin{figure}[ht!]
		\centering
		\begin{tabular}{p{5cm}|ccc}
			& \multicolumn{3}{c}{Dependent variable}\\
			\hline
			& $\underset{(1)}{ln(Assets)}$ & $\underset{(2)}{ln(Sales)}$ & $\underset{(3)}{ln(Employment)}$\\
			\hline
			$D_h$ & $\underset{(0.027)}{0.089}$ & $\underset{(0.026)}{-0.131}$ & $\underset{(0.015)}{-0.108}$\\[5pt]
			$D_h \times W_g$ & $\underset{(0.18)}{1.21}$ & $\underset{(0.17)}{0.94}$ & $\underset{(0.11)}{0.37}$ \\
			$W_g$ & $\underset{(0.28)}{0.58}$ & $\underset{(0.26)}{0.29}$ & $\underset{(0.22)}{0.21}$ \\
			\hline
		\end{tabular}
	\end{figure}
	where $D_h$ is for Dummy for AI grads, and $W_g$ is for wage growth rate for AI grads. Standard errors are displayed in parentheses. All regressions also contain a constant term.	
	\vspace{0.2cm}	

	
	\begin{enumerate}
		\item[1.] Explain why a simple regression of business outcomes on the AI-training alone may not answer the question data scientists are interested in. (5 marks)\\
		
		\item[2.] Explain how the use of wage growth of AI grads may circumvent the problem you described in 1). What's the interaction term's function in words? (5 marks)\\ 
		
		\item[3.] Explain verbally what the coefficient of 0.089 on the dummy for AI-grads in column (1) means. (5 marks) \\
		
		\item[4.] If wage growth is 10 percentage points higher, how much higher are the sales of no-AI-trained companies in the sample on average? Explain whether this effect is statistically different from zero. (5 marks) \\
		
		\item[5.] What do you conclude from the results in the table about the effect of AI training on AI adopted company's outcomes? (5 marks) \\
		
		\item[6.] Suppose you also have data for assets, sales, and employment in these companies in 2060. Suppose you were to run analogous regressions with these dependent variables to the regressions in the table above. Explain how the new regressions would help you interpret the results above \\
		
		\item[7.] As more and more AI grads flow into the labor market, given the growing competition for top-minds, ranking services has been introduced to the market. The ranking service claims that they have differentiated AI education's quality for low tier ones like Engineering ($E$) and high tier ones like Science ($S$) in year 2055. Companies pay more to high tier grads, thus the wage growth rates are now $WE_g$ and $WS_g$ for Engineering and Science, respectively. How does this change affect your analysis in 6)? (5 marks) \\
		
		\item[8.] Given the change of regime in 2055, you would like to see whether the split of the program helped companies. How do you formulate your data scientific test?  (5 marks)\\
		
		\item[9.] You have an engineering background boss whose understanding of data science is no better than collection of Gitjjab codes. He claims that deep-learning can solve every data science problems that no human logic is needed. He adds that your argument does not rely on 'the most recent and advanced deep-learning practices done by top-notch companies and researchers'. Provide your rebuttal. (10 marks)\\
		
		\item[Bonus.] Assume that you are class of 2055-2056 at an engineering program. Back then, you were mis-guided by engineering school's marketing that they guarantee 100\% graduation rate and employment rate. In addition to that, you failed in admission exam to SIAI, one of the most well-known Science tier AI program in the world. Back then, you were so scared, but after years on the job, you realized that you have wasted your money and time. Now given 8), you have strong temptation to go back to school and re-try the Science tier. If successful, you can enjoy higher wage and better appreciation of the market. Given your personal estimation of success rate, formulate your argument. (10 marks)
	\end{enumerate}
\end{question}


\end{document}
LaTeX

아무리 시험 문제를 미리 다 풀어주고 시험을 쳐도 적응하는데 힘들어하길래 F2023 기수들부터는 아예 Mock exam이라고, 실제 시험 대신 성적에 안 들어가는 예비 시험을 치뤘다.

뭔가 잔뜩 재밌는 문제를 만들었다가 아껴놔야 1월 초에 치를 진짜 시험에 쓸 수 있을 것 같기도 했고, 그냥 재밌는거 만들겠다고 하다가 너무 어렵다고 느끼면 안 될 것 같아서 양보하는 마음에 셋팅만 바꾸고 수학적 구성은 작년이랑 똑같이 냈다. 상황이 달라져도 프레임만 알고 있으면 어디든지 적용할 수 있다는 걸 깨닫는 계기가 됐으면 좋겠다.

이번에 회사 웹사이트 전체 리뉴얼 하는 중에 코드 공유하는 것도 이것저것 실험해보는 중인데, 시험 문제의 Latex 코드를 저 위에 한번 공유해봤다. 코드 띄우는 디자인에 손 좀 보고 나면, 회사 업무 내용들을 Github이나 BitBucket 같은 외부 공간에 올려놓는게 아니라, 내부의 비공개된 Knowledge base에 올려놓는 방식으로 회사 코드 관리를 해야겠다는 생각을 해 봤다.

이곳 PDSI를 통해 공유하는 모든 교육 자료가 그렇지만, 위의 Latex 코드 속의 시험 문제는 출처를 밝히고 금전적 이득을 목적으로 한 내용이 아니면 자유롭게 써도 된다.

회사 내부 웹사이트 리뉴얼이 어느 정도 정리되면 SIAI Korea라는 웹페이지를 따로 만들어서 그간 파비클래스로 운영해왔던 콘텐츠도 좀 이관하고, 해외 교육 자료들 & 우리 SIAI 교육 자료들의 한글 버전 일부를 우리 SIAI 학생들이 자랑하는 포트폴리오 용도로 쓰라고 해 줄 생각인데, 거기에 위의 예시처럼 코드 공유를 해 주면 좋겠다 싶었다. 당장 내가 쓴 교과서들 번역만 해도 물량이 만만치 않다...

그간 이래저래 Push/Press를 해 봤지만, 콘텐츠 자가 생산은 다들 힘든 것 같고, losing battle인 것 같으니 욕심을 많이 비운 상태다. 뭐든 내가 하면 금방 하겠지만, 뭔가 학생들이 할 수 있을만한 일들을 찾아서 자기들의 포트폴리오로 만들 수 있도록 해 줘야겠다 싶더라고.

그나마 지금 GIAI R&D Korea 통해서 해외 기사 번역이나 자기 논문 소개글 쓰듯이, SIAI 및 해외 명문대의 AI/DS 교육 자료들을 자기들이 이해한 방식으로 공유하는 Knowledge base 정도는 글 쓸 수 있겠지? 그건 해야 너네도 돈/시간/에너지/열정 들여서 공부한 보람이 있지 않냐?

(Multi) Author 이름 큼지막하게 찍히도록 웹사이트 디자인에 돈/시간/에너지/열정 좀 쓸테니까, 교육 잘 받았다고 자랑하는데 써 먹자^^

이걸 영어로도 찍어낼 수 있는 인재가 많아야 학교 교육의 퀄리티나 졸업생 본인들의 홍보가 한국어권 굴레를 넘어 글로벌 시장에도 먹힐텐데... 역시 이것도 한국에서는 losing battle인 것 같더라고.

암튼, 저 위의 문제 중 1번은 수학계의 필요충분조건이라는 개념을 통계학의 기초 중 하나인 Simultaneity 방식으로 검증? 반박? 분리?하는 가설을 설정하는 문제다. 물론 우리 SIAI의 모든 시험은 현실 케이스를 기반으로 하기 때문에 AI/DS 커뮤니티 만드는데 운영 관리를 어떻게 해야되는지 관련한 내부 토론하는 예시를 갖고 왔고 (해외 모 기업 내에서 실제로 일어났던 일이다),

  • 현실 -> 추상화 -> 수학 기본 원리 추출 -> 통계 검증 -> 계산 과학 도구 중 적절한 도구 선택

이라는 전형적인 데이터 사이언스 지식 응용 및 사고력을 검증 할 수 있도록 문제를 구성했다.

원래는 User의 숫자가 Network effect 때문에 Exponential growth로 증가한다는 저 주장을 User^2 (or even higher power)변수가 필요하다는 주장으로 바꾸고, 'Non-linear term이 필요하다는 주장' = 'SVM, Tree, DNN 류의 모델들이 Regression보다 더 나은 결과를 낼 것'이라는 주장으로 바꾼 다음, 그걸 계산통계학적으로 검정하려면 어떤 단계를 밟아야 하는지 등등의 문제를 만들었다가, Mock exam이니까 그냥 빼버렸다.

이렇게 스포를 날렸으니 1월 공식 시험에는 좀 더 (나한테만) 재미있게 바꾼 내용이 들어갈 것이다ㅋㅋ

Picture

Member for

6 months 3 weeks
Real name
Keith Lee
Bio
Head of GIAI Korea
Professor of AI/Data Science @ SIAI

인공지능 석·박사 학위 과정은 실제로 임금 상승에 도움을 줄 수 있을까?

인공지능 석·박사 학위 과정은 실제로 임금 상승에 도움을 줄 수 있을까?
Picture

Member for

6 months 3 weeks
Real name
Keith Lee
Bio
Head of GIAI Korea
Professor of AI/Data Science @ SIAI
국내 대기업 급여 산정 방식은 학위를 연차로 전환 계산해 모델 의미 없어
해외 방식 급여 산정시 학위 별 더미 변수 설정, 급여 성장률과 결합으로 분석 가능
더미 변수 설정 없이 단순 '인공지능 계산'으로는 잘못된 결론 얻을 수 있어

보통은 새로운 모임에 가면 공부를 많이 하고 왔다는 사실을 숨기는데, 본의 아니게 전문성이 담긴 발언을 꺼낼 수밖에 없는 순간이 오고, 결국 남들보다 가방 끈이 좀 더 길다는 사실을 토로하면 항상 받는 질문이 있다. 지식 수준이 높아진 것은 잠깐의 대화로 감을 잡을 수 있는데, 실제로 시장에서 그 가치를 더 높게 쳐 주느냐는 것이다.

같은 질문을 받을 경우, 보통 한국에서는 '이름 값'으로만 팔리는 것 같고, 해외에서는 실제로 더 많이 공부해서 더 많이 알고 있는지, 그래서 더 기업 업무에 쓰일 수 있는 인력인지 매우 꼼꼼하게 평가하는 절차를 거치는 것 같다고 대답한다. 그간 한국에서의 경험을 돌이켜보면, 저 분이 어느 대학의 대학원을 다녔다는 사실만으로 겁을 먹고 많이 알 것이라고 위축되거나, 반대로 질투하는 사례들을 종종 보기는 했던 것 같은데, 정작 실력 평가를 위한 깊이 있는 질문을 받은 적은 한 번도 없었던 것 같다.

일반적인 한국식 급여 산정 방식

안타까운 한국의 현실을 꺼내는 이유는, 학위 교육이 실제로 임금 상승이 도움이 되려면 해외 방식으로 평가 작업이 있어야 된다는 생각 때문이다. 실제로 임금 상승에 도움이 되느냐를 판단하기 위해 데이터 기반의 모델을 만든다고 한번 가정해보자. 예를 들면, 신생 회사가 좀 덩치가 커져서 이제 상위 교육을 받은 인재들을 회사에 적극적으로 영입하려고 하는데, 급여 수준을 지금까지 뽑던 인력과 다른 수준으로 책정해야 한다는 막연한 인식은 있지만, 실제로 어느 수준의 급여를 줘야하는지에 대해서 매우 피상적인 수치만을 갖고 있는 경우에 생각해 볼 만한 상황이다.

국내의 경우는 대체로 동일 산업의 대기업이 얼마의 연봉을 주고 있는지, 국내에 진출한 해외 기업들이 한국 내 인력들에게 얼마의 급여를 주고 있는지와 같은, 비교군의 정보만을 찾고 끝난다. 그 학위 과정 중 어떤 공부를 했고, 그래서 회사에 어떤 도움이 되는지를 구체적으로 판단하는 것이 아니라, '박사 학위자', '석사 학위자', '해외파', '국내파' 이렇게 단순 분리를 통해 책정되는 '급여 테이블'을 만들어 버린다.

국내에서 그간 봤던 대기업의 연봉 구조는 학위 과정을 석사 2년, 박사 5년으로 정해놓고, 해당 연차만큼 회사를 다닌 것과 같은 값으로 연봉 테이블을 적용한다. 예를 들어, 서울대학교 학부 졸업 후 바로 하버드대학교 석·박사 통합과정을 들어가 고생 끝에 6년만에 졸업한 학생이 국내 대기업에 취직한다고 할 경우, 국내 대기업 인사 팀은 박사 학위 과정에 5년을 적용해 경력 6년차 사원과 같은 수준으로 급여 구간을 산정한다. 물론 명문대 출신이라 각종 상여금 등을 통해 더 많은 급여를 약속할 수는 있지만, 국내 대기업들의 '급여 테이블' 구조가 지난 수십년간 바뀌지 않고 유지되어 왔던 만큼, 경력 6년차 사원으로 자기네 시스템에 통합시키는 것을 피할 수는 없다.

단순히 학부 출신, 석사 출신, 박사 출신 100명씩을 모아놓고, 이 분들의 급여를 알아내서 '인공지능' 분석을 하면 알아낼 수 있는 것이 아니냐는 황당한 질문들을 많이 받는데, 일단 저 위의 사례가 사실이라면, 어떤 계산법을 쓰건 상관없이 급여는 대기업 방식의 연차 시스템에 의해서 산정될 뿐, 학위 과정이 도움이 된다는 결론이 나오지 않을 것이다. 굳이 따지자면 박사 졸업하기 위해 많은 추가 공부를 더 해야 해서 5년 졸업은 불가능하고 6년 만에 졸업해도 성공이라고 칭찬 받는 학문으로 갈 경우, 해당 학위 출신은 국내 대기업 방식의 산정이 연봉 계산에 매우 불리하게 작용할 것이고, 학위하면 오히려 연봉이 깎인다는 결론이 나올 수도 있다.

단순 급여 산정 방식이 낳는 폐해

이런 사정을 아는 매우 똑똑한 인재가 있다고 생각해보자. 본인이 매우 뛰어난 역량을 갖춘 인재라면 급여 테이블로 정해진 급여에 안주할 가능성이 낮은만큼, 해당 대기업에 관심도 없는 상황이 생길 것이다. 인공지능, 반도체 등등의 주요 기술 산업 역량을 갖춘 인재를 찾는 기업들이라면 급여에 대한 고민이 깊어질 수밖에 없다. 자칫 실력은 없지만 학위만 있는 인력들을 뽑는 인사 실패를 겪을 수도 있기 때문이다.

실제로 S대의 열정파 교수들 일부가 운영하는 연구실은 학위 과정이 몇 년이 걸리건 상관없이 좋은 논문을 써야 졸업을 시켜주는 해외 방식으로 운영되고 있는데, 국내 대기업 취직을 원하는 학생들에게 많은 비난을 듣는다. 국내 연구자들에 대한 평가를 모아놓은 김박사넷 등의 웹사이트에서 해당 열정파 교수들에 대한 각종 비난을 찾을 수 있다. 국내 대기업의 단순 연차 계산 방식이 제대로 된 연구자의 성장을 막고 있는 것이다.

결국 대기업이 복잡한 판단을 할 역량이 부족해 편의로 생긴 급여 구조 탓에 대기업에서 채용하게 되는 인력은 논문의 질 따위는 무시한 채, 일반의 인식에 맞춰 2년, 5년만에 학위 과정을 마친 분들 위주로 채용이 이뤄지는 결론을 얻게 된다.

역량 방식으로 급여 산정이 이뤄지는 해외의 급여 기준 모델

한국의 답답한 현실을 벗어나, 실제로 역량에 따라 학위를 받고, 그 학위가 역량에 대한 절대적인 지표가 될 수 있는 국가의 급여 산정 방식에 맞춰 데이터 분석을 진행해보자.

우선 설명 변수로 학위 유무를 판단하는 더미 변수(Dummy variable)을 생각해 볼 수 있다. 이어 급여 성장률이 또 하나의 중요한 변수가 된다. 학위에 따라 급여 성장률이 다를 수 있기 때문이다. 마지막으로는 학위 더미 변수와 급여 성장률 변수간의 상관관계도 변수로 포함하기 위해 두 변수를 곱셈한 변수도 추가한다. 이 마지막 변수를 추가하게 되면 학위가 없는 상태에서 급여 성장률과 학위가 있는 상태에서 급여 성장률을 구분할 수 있다. 만약 석사, 박사 학위 둘을 구분하고 싶다면 더미 변수를 2종류로 정하고, 급여 성장률도 역시 2개 변수와 곱셈한 변수들을 추가하면 된다.

만약 AI 관련 학위를 한 경우와 아닌 경우를 구분하고 싶다면? AI 관련 학위를 했다는 더미 변수를 추가하고, 역시 위와 같은 방식의로 급여 성장률과 곱셈한 변수를 더 추가하면 된다. 당연하겠지만 반드시 AI에 국한될 필요는 없고, 다양한 가능성들을 바꿔서 적용해볼 수 있을 것이다.

여기서 하나 나오는 질문이, 학교 별로 명성이 다르고, 실제로 졸업생들의 실력도 제각각일텐데, 구분할 수 있는 방법이 있냐는 것이다. 역시 위의 AI 관련 학위 조건 추가와 마찬가지로, 새로운 더미 변수를 하나씩 더 추가하면 된다. 예를 들어, 상위 5개 대학 졸업인지 여부, 졸업 논문이 고급 저널에 실린적이 있는지 여부 같은 것들을 더미 변수로 만들면 된다.

왜 한국에서는 못 쓰는 계산법이라고 생각하나요?

위의 해외 기준 급여 모델이 한국에서 적용되기 어려운 가장 큰 이유는, 한국 기업 문화에서 실제로 고급 학위 과정의 연구 방법론이 적용되는 경우가 극히 드물고, 그 가치가 실제로 회사의 이익으로 바뀌는 경우도 매우 희귀하기 때문이다.

최근 국가 연구개발(R&D) 프로젝트에 대한 지원금을 대규모로 감축하겠다는 발표가 나왔다. 2024년도 정부 예산안에 따르면 2023년 대비 약 20% 정도의 R&D 예산이 삭감될 예정이다. 지난 2017년 대비 2023년까지 20조원이 무려 30조원으로 늘었던만큼, 실질적으로는 과다 지급되었던 것이 일부 조정되는 상황이나, 현장에서 느끼는 감정들은 그렇지 않은가보다.

그런데, 그런 불만이 많은 분들이 하시는 연구들 대부분이 통계학 훈련이 전혀 되지 않은 상태다. 실제로 국내 귀국 후 통계학 기초 훈련을 거부하는 수 많은 공대 출신들에게 귀가 따갑게 듣는 내용 중 하나가, '인공지능 계산법'을 쓰면 굳이 통계학적인 데이터 변형을 쓰지 않아도 되지 않느냐는 반박이다. 예시를 하나 들어보자.

'인공지능 계산법'을 쓰면 굳이 더미 변수 따위는 안 만들어도 되지 않나요?

위의 사례에서 더미 변수를 만들지 않고 단순히 카테고리 변수(Categorical variable)로 지정해서 데이터 분석 작업을 진행할 때, 실제로 컴퓨터 코드는 해당 카테고리들을 더미 변수로 변형하는 작업을 거친다. 머신러닝 분야에서 해당 작업을 'One-hot-encoding'이라고 부른다. 그러나, '학사-석사-박사'를 '1-2-3' 혹은 '0-1-2'로 변형할 경우, 석사 학위자 대비 박사 학위자의 연봉 계산 가중치가 각각 1.5배(2-3의 비율), 혹은 2배(1-2의 비율)로 계산하는 오류가 발생하게 된다. 이 때는 석사 학위, 박사 학위를 독립된 변수로 구분해야 각각의 연봉 인상 효과를 분리할 수 있다. 잘못된 가중치가 들어갔을 경우, '0-1-2'일 경우에는 자칫 박사 학위는 연봉 상승률이 석사 학위 대비 절반 남짓으로 떨어진다는 결론이 나올 수도 있고, '1-2-3'의 경우에도 마찬가지로 석사, 박사 학위의 연봉 상승률을 실제 효과보다 50%, 67% 낮춰 평가하는 오류를 범하게 된다.

본질적으로 '인공지능 계산법'들이 통계학의 회귀분석을 비선형으로 처리하는 계산들인만큼, 회귀분석에서 변수 별 효과 구분을 위해 필수적인 데이터 전처리 작업을 피할 수 있는 경우는 매우 드물다. 일반에 알려진 파이썬(Python) 등의 기초 언어에서 널리 쓰이는 데이터 함수 집합(Library)들이 이런 경우들을 모두 고려해서 데이터 별로 상황에 맞게 비전공자 수준의 결론을 알려주지는 않는다.

특정 언론사 기사나 해당 기사들이 언급하는 논문들을 지적하지 않더라도, 학위 과정이 연봉 상승에 크게 도움이 되지 않는다는 표현들을 종종 본 적이 있을 것이다. 그런 논문들을 보고나면 꼭 위와 같은 기초적인 오류가 있는지를 점검하는 과정을 거친다. 안타깝게도 국내에서 이렇게 변수 선정 및 변형에 꼼꼼하게 신경을 쓴 논문을 보기는 쉽지 않다.

변수 선정 및 분리, 정제 작업에 대한 이해도가 부족해 잘못된 결론을 얻는 것은 비단 한국 공대 출신들에게서만 일어나는 일은 아니다. 아마존(Amazon)에서 개발자 채용 중에 개발자들이 코드 공유를 많이하는 플랫폼 중 하나인 깃헙(Github)에 올라간 코드의 문자열 길이 수(Byte)를 변수 중 하나로 썼다는 표현을 들은 적이 있는데, 역량을 판단하는 좋은 변수였다기보다는 잘 보여주기 위해 얼마나 더 신경을 썼느냐는 잣대로 볼 수는 있지 않을까 싶다.

공학도들 상당수가 단순히 구글 검색을 통해 본 유사 사례 코드를 그대로 복사해서 붙여넣기하고는 데이터 분석을 했다고 주장하는 경우가 매우 많은데, IT업계의 개발은 같은 방식으로 진행해도 큰 문제가 없는 경우가 있겠으나, 위의 사례와 같이 연구 주제에 맞춘 데이터 변형이 필수인 부분에서는 최소한 학부 수준 이상의 통계학 지식이 필수적인만큼, 고급 데이터를 모아놓고 잘못된 데이터 분석으로 잘못된 결론을 내는 경우는 피하도록 노력하자.

Picture

Member for

6 months 3 weeks
Real name
Keith Lee
Bio
Head of GIAI Korea
Professor of AI/Data Science @ SIAI

[공지] 2023년 11월 The Economy Korea 지원자 과제

[공지] 2023년 11월 The Economy Korea 지원자 과제
Picture

Member for

6 months 3 weeks
Real name
The Economy
Bio
https://economy.ac
[email protected]
The Economy Administrator

아래는 지원자들에게 보내는 이메일입니다. 11월 지원부터 직접 이메일을 보내는 대신 일반에 과제를 공개하고 답안만 받습니다.


안녕하세요

저희 The Economy Korea에 관심을 가져주셔서 감사합니다. 아래의 간단 과제를 [email protected]로 보내주시면 내부 논의 후 다시 연락드리도록 하겠습니다.

과제 소개

간단 과제: 아래 보도자료로 뿌려진 기사를 바탕으로 요청 사항에 맞춰 재작성

*주의: 기사, 그것도 고급 기사를 작성해야 합니다. 기사 아닌 다른 보고서 형태의 글, 혹은 기사이지만 분석력이 없는 수준 낮은 글을 찾지 않습니다.

배경 지식 – 저희 내부 기사 예시

단순 정보 전달만 하는 보도자료에서 누락될만한 분석적인 부분을 추가한 기사들

과제 작업 중에는 보도자료 -> 자체제작기사처럼 작성해주시면 됩니다. 이해를 돕기위해 팔로업 기사까지 추가해드립니다. 내부적으로는 소제목으로 추가되는 꼭지를 2-3개 뽑아드리는 총괄 관리, 편집인 및 인포그래픽 디자인 담당이 있습니다. 본 과제는 꼭지에 맞춘 논지를 끌어나갈 힘이 있는 분인지 판단하기 위한 목적입니다.

기사 작성 가이드

보도자료 요약

ㄴ보도자료 링크: 디즈니플러스 신규 가입자 중 절반이 광고 상품 선택 - ZDNet korea
ㄴLead-in: 광고 요금제 쓰는 사람들이 이렇게 많아졌군요. 역시 가격이 떨어지면 그만큼 수요는 늘어날 수밖에 없을텐데, 반대로 가격을 낮추면 수익성이 떨어질 테니 디즈니도 고민이 많겠습니다. 광고로 수익 부족분을 메워야 될텐데, 요즘처럼 데이터 이용해서 광고 타게팅하는 것도 불법된 시대에 광고가 수익성이 나려나요…

짧게 2개 정도의 작은 문단으로 기사 전체를 요약한 Lead-in을 작성하기 바랍니다.

보도자료 작성법 - 뉴스와이어 (newswire.co.kr) – 6.도입부 해당 설명 참조

Talking Point
1. 요금제 인상 안을 발표했는데, 정작 사람들은 비싸다고 다들 광고 요금제를 고른다
디즈니플러스, 다음 달부터 구독료 인상‧계정공유 금지…득일까 실일까 - 파이낸셜뉴스 (fnnews.com)
"광고 보는 요금제가 더 이득" 디즈니플러스가 가입자 정체에도 꿈쩍 않는 이유 - ITWorld Korea 
사람들 생각이 다 이런가보네요

2.문제는 광고로 수익 내기가 쉽지 않다는거죠. 그리고 사람들이 생각보다 주머니를 잘 안 연다는거구요
美 디즈니+ 구독자 94%, 요금제 변경후 "돈 더 내고 광고 안봐" (mk.co.kr)
ㄴ말은 저렇게 해놓고 정작 절반이 광고 요금제로 가입했다? 기존 가입자들은 광고 요금제 싫어하는 사람들이고, 새로 가입한 사람들은 광고 요금제 때문에 저렴하니까 가입한거다?
OTT도 광고 봐야 하나?… 넷플릭스 이어 아마존 광고요금제 도입 : IT : 경제 : 뉴스 : 한겨레 (hani.co.kr)
일종의 저가형 요금제로 구매력이 낮은 구독자를 더 끌어들이는데는 성공했지만, 반대로 그 구독자들에게 광고 뿌려서 돈은 얼마나 벌었나를 생각해봐야 되는데,

디즈니, 매출·구독자 수 모두 하락…테마파크는 성장 - ZDNet korea
광고 수익이 더 늘어나고는 있답니다. 이게 더 안정적으로 늘어나면 굳이 광고 요금제가 손해가 되지는 않겠네요

3.가입자 수 정체인 한국 OTT들에게도 고민이 될 것 같습니다
티빙도 디즈니도 수익성 확보 고심…광고요금제 도입·요금인상 릴레이 - 조선비즈 (chosun.com)
ㄴ쿠팡플레이는 4,900원이잖아요. 다른 곳들은 최소 1만원이 넘는데...
한국도 저렇게 저가형 요금제인 광고 요금제를 출시하면 수익성이 좀 더 오를 수 있지 않겠냐는 고민하는 OTT업체들이 늘어나겠네요

구글, '유튜브 프리미엄' 가격 올린다 - ZDNet korea
유튜브 프리미엄도 14달러인데 돈 내고 쓰는 사람들 많잖아요. 반면 광고보고 쓰는 사람들은 어마어마하게 많고 ㅋㅋ

가이드 관련 설명

자체 Talking point들을 소제목 1개씩으로 뽑아서 원래의 보도자료를 Lead-in과 3-4개의 소제목이 추가된 기사로 만들어주시면 됩니다. 각 소제목 별로 대략 3문단 정도의 논지 전개를 통해 기존 보도자료의 부족한 점을 메워넣으시면 됩니다. 위의 방식이 실제로 일하는 방식입니다.

던져드리는 포인트들을 빠르게 읽고 소화해서 보도자료에 추가 정보를 붙인 고급 기사로 변형시키는 업무를 거의 대부분 못하시는데, 이유가

  • 1.내용을 이해 못하는 경우와
  • 2.기사 형태의 글로 작성하지 못하는 경우

로 구분됩니다. 대부분은 내용을 이해 못해서 기사 자체를 쓰지도 못하고, 시간을 들여 노력해도 이해를 못해서, 빠르게 이해할 수 있는 능력을 점검하기 위해 이런 테스트를 만들었습니다.

더불어 블로그 글을 쓰는 것이 아니라 기사로 만들어야하니까, 기사형 문체를 쓸 수 있는지도 확인 대상입니다.

거의 대부분은 1번에서 문제가 있어서 읽는 사람을 당황스럽게 만드는 경우가 많고, 최근에는 2번에 문제가 있는데도 불구하고 지원하는 사례들도 부쩍 늘었습니다. 저희 언론사들의 기사를 몇 개 정도 읽어보고 2번에 좀 더 신경써서 작업 부탁드립니다.

실제 업무시 진행 속도

실제 업무를 시작하면 처음 적응기에는 3-4시간을 써야 기사 1개를 쓰시던데, 점차 시간이 줄어들어 2시간 이내에 쓰시게 되더라구요. 빠르게 쓰시는 분들은 20~30분에 1개 씩의 기사를 작성하십니다.

시급제로 운영하다가 최근 시스템이 안착되고 난 다음부터는 1건당으로 급여를 책정했습니다. 기본급은 1건 당 25,000원입니다만, 퀄리티가 나오는 기사만 싣고 있어 실질적인 운영은 +5,000원해서 30,000원입니다.

word나 hwp로 작성하신 기사 보내주시면 확인 후 답변드리겠습니다.

건투를 빌겠습니다.

감사합니다

The Economy Korea 운영팀

Picture

Member for

6 months 3 weeks
Real name
The Economy
Bio
https://economy.ac
[email protected]
The Economy Administrator

Is online degree inferior to offlinie degree?

Is online degree inferior to offlinie degree?
Picture

Member for

6 months 3 weeks
Real name
Keith Lee
Bio
Head of GIAI Korea
Professor of AI/Data Science @ SIAI

Modified

Not the quality of teaching, but the way it operates
Easier admission and graduation bar applied to online degrees
Studies show that higher quality attracts more passion from students

Although much of the prejudice against online education courses has disappeared during the COVID-19 period, there is still a strong prejudice that online education is of lower quality than offline education. This is what I feel while actually teaching, and although there is no significant difference in the content of the lecture itself between making a video lecture and giving a lecture in the field, there is a gap in communication with students, and unless a new video is created every time, it is difficult to convey past content. It seems like there could be a problem.

On the other hand, I often get the response that it is much better to have videos because they can listen to the lecture content repeatedly. Since the course I teach is an artificial intelligence course based on mathematics and statistics, I heard that students who forget or do not know mathematical terminology and statistical theory often play the video several times and look up related concepts through textbooks or Google searches. There is a strong prejudice that the level of online education is lower, but since it is online and can be played repeatedly, it can be seen as an advantage that advanced concepts can be taught more confidently in class.

Is online inferior to offline?

While running a degree program online, I have been wondering why there is a general prejudice about the gap between offline and online. The conclusion reached based on experience until recently is that although the lecture content is the same, the operating method is different. How on earth is it different?

The biggest difference is that, unlike offline universities, universities that run online degree programs do not establish a fierce competition system and often leave the door to admission widely open. There is a perception that online education is a supplementary course to a degree course, or a course that fills the required credits, but it is extremely rare to run a degree course that is so difficult that it is perceived as a course that requires a difficult challenge as a professional degree.

Another difference is that there is a big difference in the interactions between professors and students, and among students. While pursuing a graduate degree in a major overseas city such as London or Boston, having to spend a lot of time and money to stay there was a disadvantage, but the bond and intimacy with the students studying together during the degree program was built very densely. Such intimacy goes beyond simply knowing faces and becoming friends on social media accounts, as there was the common experience of sharing test questions and difficult content during a degree, and resolving frustrating issues while writing a thesis. You may have come to think that offline education is more valuable.

Domestic Open University and major overseas online universities are also trying to create a common point of contact between students by taking exams on-site instead of online or arranging study groups among students in order to solve the problem of bonding and intimacy between students. It takes a lot of effort.

The final conclusion I came to after looking at these cases was that the difficulty of admission, the difficulty of learning content, the effort to follow the learning progress, and the similar level of understanding among current students were not found in online universities so far, so we can compare offline and online universities. I came to the conclusion that there was a distinction between .

Would making up for the gap with an online degree make a difference?

First of all, I raised the level of education to a level not found in domestic universities. Most of the lecture content was based on what I had heard at prestigious global universities and what my friends around me had heard, and the exam questions were raised to a level that even students at prestigious global universities would find challenging. There were many cases where students from prestigious domestic universities and those with master's or doctoral degrees from domestic universities thought it was a light degree because it was an online university, but ran away in shock. There was even a community post asking if . Once it became known that it was an online university, there was quite a stir in the English-speaking community.

I have definitely gained the experience of realizing that if you raise the difficulty level of education, the aspects that you lightly think of as online largely disappear. So, can there be a significant difference between online and offline in terms of student achievement?

Source=Swiss Institute of Artificial Intelligence

The table above is an excerpt from a study conducted to determine whether the test score gap between students who took classes online and students who took classes offline was significant. In the case of our school, we have never run offline lectures, but a similar conclusion has been drawn from the difference in grades between students who frequently visited offline and asked many questions.

First, in (1) – OLS analysis above, we can see that students who took online classes received grades that were about 4.91 points lower than students who took offline classes. Various conditions must be taken into consideration, such as the student's level may be different, the student may not have studied hard, etc. However, since it is a simple analysis that does not take into account any consideration, the accuracy is very low. In fact, if students who only take classes online do not go to school due to laziness, their lack of passion for learning may be directly reflected in their test scores, but this is an analysis value that is not reasonably reflected.

To solve this problem, in (2) – IV, the distance between the offline classroom and the students' residence was used as an instrumental variable that can eliminate the external factor of students' laziness. This is because the closer the distance is, the easier it will be to take offline classes. Even though external factors were removed using this variable, the test scores of online students were still 2.08 points lower. After looking at this, we can conclude that online education lowers students' academic achievement.

However, a question arose as to whether it would be possible to leverage students' passion for studying beyond simple distance. While looking for various variables, I thought that the number of library visits could be used as an appropriate indicator of passion, as it is expected that passionate students will visit the library more actively. The calculation transformed into (3) - IV showed that students who diligently attended the library received 0.91 points higher scores, and the decline in scores due to online education was reduced to only 0.56 points.

Another question that arises here is how close the library is to the students' residences. Just as the proximity to an offline classroom was used as a major variable, the proximity of the library is likely to have had an effect on the number of library visits.

So (4) – After confirming that students who were assigned a dormitory by random drawing using IV calculations did not have a direct effect on test scores by analyzing the correlation between distance from the classroom and test scores, we determined the frequency of library visits among students in that group. and recalculated the gap in test scores due to taking online courses.

(5) – As shown in IV, with the variable of distance completely removed, visiting the library helped increase the test score by 2.09 points, and taking online courses actually helped increase the test score by 6.09 points.

As can be seen in the above example, the basic simple analysis of (1) leads to a misleading conclusion that online lectures reduce students' academic achievement, while the calculation in (5) after readjusting the problem between variables shows that online lectures reduce students' academic achievement. Students who listened carefully to lectures achieved higher achievement levels.

This is consistent with actual educational experience: students who do not listen to video lectures just once, but take them repeatedly and continuously look up various materials, have higher academic achievement. In particular, students who repeated sections and paused dozens of times during video playback performed more than 1% better than students who watched the lecture mainly by skipping quickly. When removing the effects of variables such as cases where students were in a study group, the average score of fellow students in the study group, score distribution, and basic academic background before entering the degree program, the video lecture attendance pattern is simply at the level of 20 or 5 points. It was not a gap, but a difference large enough to determine pass or fail.

Not because it is online, but because of differences in students’ attitudes and school management

The conclusion that can be confidently drawn based on actual data and various studies is that there is no platform-based reason why online education should be undervalued compared to offline education. The reason for the difference is that universities are operating online education courses as lifelong education centers to make additional money, and because online education has been operated so lightly for the past several decades, students approach it with prejudice.

In fact, by providing high-quality education and organizing the program in a way that it was natural for students to fail if they did not study passionately, the gap with offline programs was greatly reduced, and the student's own passion emerged as the most important factor in determining academic achievement.

Nevertheless, completely non-face-to-face education does not help greatly in increasing the bond between professors and students, and makes it difficult for professors to predict students' academic achievement because they cannot make eye contact with individual students. In particular, in the case of Asian students, they rarely ask questions, so I have experienced that it is not easy to gauge whether students are really following along well when there are no questions.

A supplementary system would likely include periodic quizzes and careful grading of assignment results, and if the online lecture is being held live, calling students by name and asking them questions would also be a good idea.

Picture

Member for

6 months 3 weeks
Real name
Keith Lee
Bio
Head of GIAI Korea
Professor of AI/Data Science @ SIAI

Can a graduate degree program in artificial intelligence actually help increase wages?

Can a graduate degree program in artificial intelligence actually help increase wages?
Picture

Member for

6 months 3 weeks
Real name
Keith Lee
Bio
Head of GIAI Korea
Professor of AI/Data Science @ SIAI

Modified

Asian companies convert degrees into years of work experience
Without adding extra values to AI degree, it doesn't help much in salary
'Dummification' in variable change is required to avoid wrong conclusion

In every new group, I hide the fact that I have studied upto PhD, but there comes a moment when I have no choice but to make a professional remark. When I end up revealing that my bag strap is a little longer than others, I always get asked questions. They sense that I am an educated guy only through a brief conversation, but the question is whether the market actually values ​​it more highly.

When asked the same question, it seems that in Asia they are usually sold only for their 'name value', and the western hemisphere, they seem to go through a very thorough evaluation process to see if one has actually studied more and know more, and are therefore more capable of being used in corporate work.

artificialintelligence 1024x643 1

Typical Asian companies

I've met many Asian companies, but hardly had I seen anyone with a reasonable internal validation standard to measure one's ability, except counting years of schooling as years of work experience. Given that for some degrees, it takes way more effort and skillsets than others, you may come to understand that Asian style is too rigid to yield misrepresentation of true ability.

In order for degree education to actually help increase wages, a decent evaluation model is required. Let's assume that we are creating a data-based model to determine whether the AI degree actually helps increase wages. For example, a new company has grown a bit and is now actively trying to recruit highly educated talent to the company. Although there is a vague perception that the salary level should be set at a different level from the personnel it has hired so far, there is actually a certain level of salary. This is a situation worth considering if you only have very superficial figures about whether you should give it.

Asian companies usually end up only looking for comparative information, such as how much salary large corporations in the same industry are paying. Rather than specifically judging what kind of study was done during the degree program and how helpful it is to the company, the 'salary' is determined through simple separation into Ph.D, Masters, or Bachelors. Since most Asian universities have lower standard in grad school, companies separate graduate degrees by US/Europe and Asia. They create a salary table for each group, and place employees into the table. That's how they set salaries.

The annual salary structure of large companies that I have seen in Asia sets the degree program to 2 years for a master's and 5 years for a doctoral degree, and applies the salary table based on the value equivalent to the number of years worked at the company. For example, if a student who entered the integrated master's and doctoral program at Harvard University immediately after graduating from an Asian university and graduated after 6 years of hard work gets a job at an Asian company, the human resources team applies 5 years to the doctoral degree program. The salary range is calculated at the same level as an employee with 5 years of experience. Of course, since you graduated from a prestigious university, you may expect higher salary through various bonuses, etc., but as the 'salary table' structure of Asian companies has remained unchanged for the past several decades, it is difficult to avoid differenciating an employee with 6 years of experience with a PhD holder from a prestigious university.

I get a lot of absurd questions about whether it would be possible to find out by simply gathering 100 people with bachelor, master, and doctoral degree, finding out their salaries, and performing 'artificial intelligence' analysis. If the above case is true, then no matter what calculation method is used, be it highly computer resouce consuming recent calculation method or simple linear regression, as long as salary is calculated based on the annualization, it will not be concluded that a degree program is helpful. There might be some PhD programs that require over 6 years of study, yet your salary in Asian companies will be just like employees with 5 years experience after a bachelor's.

Harmful effects of a simple salary calculation method

Let's imagine that there is a very smart person who knows this situation. If you are a talented person with exceptional capabilities, it is unlikely that you will settle for the salary determined by the salary table, so a situation may arise where you are not interested in the large company. Companies looking for talent with major technological industry capabilities such as artificial intelligence and semiconductors are bound to have deeper concerns about salary. This is because you may experience a personnel failure by hiring people who are not skilled but only have a degree.

In fact, the research lab run by some passionate professors at Seoul National University operates by the western style that students have to write a decent dissertation if to graduate, regardless of how many years it takes. This receives a lot of criticism from students who want to get jobs at Korean companies. You can find various criticisms of the passionate professors on websites such as Dr. Kim's Net, which compiles evaluations of domestic researchers. The simple annualization is preventing the growth of proper researchers.

In the end, due to the salary structure created for convenience due to Asian companies lacking the capacity to make complex decisions, the people they hire are mainly people who have completed a degree program in 2 or 5 years in line with the general perception, ignoring the quality of thesis.

Salary standard model where salary is calculated based on competency

Let's step away from frustrating Asian cases. So you get your degree by competency. Let's build a data analysis in accordance with the western standard, where the degree can be an absolute indicator of competency.

First, you can consider a dummy variable that determines whether or not you have a degree as an explanatory variable. Next, salary growth rate becomes another important variable. This is because salary growth rates may vary depending on the degree. Lastly, to include the correlation between the degree dummy variable and the salary growth rate variable as a variable, a variable that multiplies the two variables is also added. Adding this last variable allows us to distinguish between salary growth without a degree and salary growth with a degree. If you want to distinguish between master's and doctoral degrees, you can set two types of dummy variables and add the salary growth rate as a variable multiplied by the two variables.

What if you want to distinguish between those who have an AI-related degree and those who have not? Just add a dummy variable indicating that you have an AI-related degree, and add an additional variable multiplied by the salary growth rate in the same manner as above. Of course, it does not necessarily have to be limited to AI, and various possibilities can be changed and applied.

One question that arises here is that each school has a different reputation, and the actual abilities of its graduates are probably different, so is there a way to distinguish them? Just like adding the AI-related degree condition above, just add one more new dummy variable. For example, you can create dummy variables for things like whether you graduated from a top 5 university or whether your thesis was published in a high-quality journal.

If you use the ‘artificial intelligence calculation method’, isn’t there a need to create dummy variables?

The biggest reason why the above overseas standard salary model is difficult to apply in Asia is that it is extremely rare for the research methodology of advanced degree courses to actually be applied, and it is also very rare for the value to actually translate into company profits.

In the above example, when data analysis is performed by simply designating a categorical variable without creating a dummy variable, the computer code actually goes through the process of transforming the categories into dummy variables. In the machine learning field, this task is called ‘One-hot-encoding’. However, when 'Bachelor's - Master's - Doctoral' is changed to '1-2-3' or '0-1-2', the weight in calculating the annual salary of a doctoral degree holder is 1.5 times that of a master's degree holder (ratio of 2-3). , or an error occurs when calculating by 2 times (ratio of 1-2). In this case, the master's degree and doctoral degree must be classified as independent variables to separate the effect of each salary increase. If the wrong weight is entered, in the case of '0-1-2', it may be concluded that the salary increase rate for a doctoral degree falls to about half that of a master's degree, and in the case of '1-2-3', the same can be said for a master's degree. , an error is made in evaluating the salary increase rate of a doctoral degree by 50% or 67% lower than the actual effect.

Since 'artificial intelligence calculation methods' are essentially calculations that process statistical regression analysis in a non-linear manner, it is very rare to avoid data preprocessing, which is essential for distinguishing the effects of each variable in regression analysis. Data function sets (library) widely used in basic languages ​​such as Python, which are widely known, do not take all of these cases into consideration and provide conclusions at the level of non-majors according to the situation of each data.

Even if you do not point out specific media articles or the papers they refer to, you may have often seen expressions that a degree program does not significantly help increase salary. After reading such papers, I always go through the process of checking to see if there are any basic errors like the ones above. Unfortunately, it is not easy to find papers in Asia that pay such meticulous attention to variable selection and transformation.

Obtaining incorrect conclusions due to a lack of understanding of variable selection, separation, and purification does not only occur among Korean engineering graduates. While recruiting developers at Amazon, I once heard that the number of string lengths (bytes) of the code posted on Github, one of the platforms where developers often share code, was used as one of the variables. This is a good way to judge competency. Rather than saying it was a variable, I think it could be seen as a measure of how much more care was taken to present it well.

There are many cases where many engineering students claim that they simply copied and pasted code from similar cases they saw through Google searches and analyzed the data. However, there may be cases in the IT industry where there are no major problems if development is carried out in the same way. As in the case above, in areas where data transformation tailored to the research topic is essential, statistical knowledge at least at the undergraduate level is essential, so let's try to avoid cases where advanced data is collected and incorrect data analysis leads to incorrect conclusions.

Picture

Member for

6 months 3 weeks
Real name
Keith Lee
Bio
Head of GIAI Korea
Professor of AI/Data Science @ SIAI

Did Hongdae's hip culture attract young people? Or did young people create 'Hongdae style'?

Did Hongdae's hip culture attract young people? Or did young people create 'Hongdae style'?
Picture

Member for

6 months 3 weeks
Real name
Keith Lee
Bio
Head of GIAI Korea
Professor of AI/Data Science @ SIAI

Modified

The relationship between a commercial district and the concentration of consumers in a specific generation mostly is not by causal effect
Simultaneity oftern requires instrumental variables
Real cases also end up with mis-specification due to endogeneity

When working on data science-related projects, causality errors are common issues. There are quite a few cases where the variable thought to be the cause was actually the result, and conversely, the variable thought to be the result was the cause. In data science, this error is called ‘Simultaneity’. The first place where related research began was in econometrics, which is generally referred to as the three major data endogeneity errors along with loss of important data (Omitted Variable) and data inaccuracy (Measurement error).

As a real-life example, let me bring in a SIAI's MBA student's thesis . Based on the judgment that the commercial area in front of Hongik University in Korea would have attracted young people in their 2030s, the student hypothesized that by finding the main variables that attract young people, it would be possible to find the variables that make up the commercial area where young people gather. If the student's assumptions are reasonable, those who analyze commercial districts in the future will be able to easily borrow and use the model, and commercial district analysis can be used not only for those who want to open only small stores, but also for various areas such as promotional marketing of consumer goods companies, street marketing of credit card companies, etc.

Hongdae station in Seoul, Korea

Simultaneity error

However, unfortunately, it is not the commercial area in front of Hongdae that attracts young people in their 2030s, but a group of schools such as Hongik University and nearby Yonsei University, Ewha Womans University, and Sogang University that attract young people. In addition, the subway station one of the transportation hubs in Seoul. The commercial area in front of Hongdae, which was thought to be the cause, is actually the result, and young people in their 2030s, who were thought to be the result, may be the cause. In cases of such simultaneity, when using regression analysis or various non-linear regression models that have recently gained popularity (e.g. deep learning, tree models, etc.), it is likely that the simultaneity either exaggerates or under-estimates explanatory variables' influence.

The field of econometrics has long introduced the concept of ‘instrumental variable’ to solve such cases. It can be one of the data pre-processing tasks that removes problematic parts regardless of any of the three major data internal error situations, including parts where causal relationships are complex. Since the field of data science was recently created, it has been borrowing various methodologies from surrounding disciplines, but since its starting point is the economics field, it is an unfamiliar methodology to engineering majors.

In particular, people whose way of thinking is organized through natural science methodologies such as mathematics and statistics that require perfect accuracy are often criticized as 'fake variables', but the data in our reality has various errors and correlations. As such, it is an unavoidable calculation in research using real data.

From data preprocessing to instrumental variables

Returning to the commercial district in front of Hongik University, I asked the student "Can you find a variable that is directly related to the simultaneous variable (Revelance condition) but has no significant relationship (Orthogonality condition) with the other variable among the complex causal relationship between the two? One can find variables that have an impact on the growth of the commercial district in front of Hongdae but have no direct effect on the gathering of young people, or variables that have a direct impact on the gathering of young people but are not directly related to the commercial district in front of Hongdae.

First of all, the existence of nearby universities plays a decisive role in attracting young people in their 2030s. The easiest way to find out whether the existence of these universities was more helpful to the population of young people, but is not directly related to the commercial area in front of Hongdae, is to look at the youth density by removing each school one by one. Unfortunately, it is difficult to separate them individually. Rather, a more reasonable choice of instrumental variable would be to consider how the Hongdae commercial district would have functioned during the COVID-19 period when the number of students visiting the school area while studying non-face-to-face has plummeted.

In addition, it is also a good idea to compare the areas in front of Hongik University and Sinchon Station (one station to east, which is another symbol of hipster town) to distinguish the characteristics of stores that are components of a commercial district, despite having commonalities such as transportation hubs and high student crowds. As the general perception is that the commercial area in front of Hongdae is a place full of unique stores that cannot be found anywhere else, the number of unique stores can be used as a variable to separate complex causal relationships.

How does the actual calculation work?

The most frustrating part from engineers so far has been the calculation methods that involve inserting all the variables and entering all the data with blind faith that ‘artificial intelligence’ will automatically find the answer. Among them, there is a method called 'stepwise regression', which is a calculation method that repeats inserting and subtracting various variables. Despite warnings from the statistical community that it should be used with caution, many engineers without proper statistics education are unable to use it. Too often I have seen this calculation method used haphazardly and without thinking.

As pointed out above, when linear or non-linear series regression analysis is calculated without eliminating the 'error of simultaneity', which contains complex causal relationships, events in which the effects of variables are over/understated are bound to occur. In this case, data preprocessing must first be performed.

Data preprocessing using instrumental variables is called ‘2-Stage Least Square (2SLS)’ in the data science field. In the first step, complex causal relationships are removed and organized into simple causal relationships, and then in the second step, the general linear or non-linear regression analysis we know is performed.

In the first stage of removal, regression analysis is performed on variables used as explanatory variables using one or several instrumental variables selected above. Returning to the example of the commercial district in front of Hongik University above, young people are the explanatory variables we want to use, and variables related to nearby universities, which are likely to be related to young people but are not expected to be directly related to the commercial district in front of Hongik University, are used. will be. If you perform a regression analysis by dividing the relationship between the number of young people and universities before and after the COVID-19 pandemic period as 0 and 1, you can extract only the part of the young people that is explained by universities. If the variables extracted in this way are used, the relationship between the commercial area in front of Hongdae and young peoplecan be identified through a simple causal relationship rather than the complex causal relationship above.

Failure cases of actual companies in the field

Since there is no actual data, it is difficult to make a short-sighted opinion, but looking at the cases of 'error of simultaneity' that we have encountered so far, if all the data were simply inserted without 2SLS work and linear or non-linear regression analysis was calculated, the area in front of Hongdae is because there are many young people. A great deal of weight is placed on the simple conclusion that the commercial district has expanded, and other than for young people, monthly rent in nearby residential and commercial areas, the presence or absence of unique stores, accessibility near subway and bus stops, etc. will be found to be largely insignificant values. This is because the complex interaction between the two took away the explanatory power that should have been assigned to other variables.

There are cases where many engineering students who have not received proper education in Korea claim that it is a 'conclusion found by artificial intelligence' by relying on tree models and deep learning from the perspective of 'step analysis', which inserts multiple variables at intersections, but there is an explanation structure between variables. There is only a difference in whether it is linear or non-linear, and therefore the explanatory power of the variable is partially modified, but the conclusion is still the same.

The above case is actually perfectly consistent with the mistake made when a credit card company and a telecommunications company jointly analyzed the commercial district in the Mapo-gu area. An official who participated in the study used the expression, 'Collecting young people is the answer,' but then as expected, there was no understanding of the need to use 'instrumental variables'. He simply thought data pre-processing as nothing more than dis-regarding missing data.

In fact, the elements that make up not only Hongdae but also major commercial districts in Seoul are very complex. The reason why young people gather is mostly because the complex components of the commercial district have created an attractive result that attracts people, but it is difficult to find the answer through simple ‘artificial intelligence calculations’ like the above. When trying to point out errors in the data analysis work currently being done in the market, I simply chose 'error of simultaneity', but it also included errors caused by missing important variables (Omitted Variable Bias) and inaccuracies in collected variable data (Attenuation bias by measurement error). It requires quite advanced modeling work that requires complex consideration of such factors.

We hope that students who are receiving incorrect machine learning, deep learning, and artificial intelligence education will learn the above concepts and be able to do rational and systematic modeling.

Picture

Member for

6 months 3 weeks
Real name
Keith Lee
Bio
Head of GIAI Korea
Professor of AI/Data Science @ SIAI

'예/아니오' 잘 맞추는 모델이 무조건 좋은 모델일까?

'예/아니오' 잘 맞추는 모델이 무조건 좋은 모델일까?
Picture

Member for

6 months 3 weeks
Real name
Keith Lee
Bio
Head of GIAI Korea
Professor of AI/Data Science @ SIAI
고분산 데이터에는 0/1 맞추기 모델 무의미, 새로운 데이터에서 같은 정확도 내기 힘들어
해석 가능한 인공지능은 결국 기초 통계학 모델로 돌아가는 것
무조건 '인공지능'='고급모델'='정확한 모델' 아냐, 잘못된 모델 쓸 경우 잘못된 해석 밖에 나오지 않아

5년 전의 일이다. 보스턴 지역의 주거지 관련 데이터를 이용해, 방 크기, 방 숫자 등의 정보를 이용해 집 값, 혹은 월세를 맞추는 단순한 '인공지능' 학습자료가 SNS를 통해 좀 퍼진지 얼마 지나지 않았던 상황이었는데, 그 모델을 어디까지 써 본적이 있냐는 홈페이지를 만들어 놓은 어느 발표 모임에서 데이터 과학을 이용한 타겟 광고 모델을 설명해달라는 요청을 받은 적이 있다.

그렇게 수준 낮은 발표 모임에 유명 대기업이 상당한 후원을 하고 있다는 사실에 충격 먹기도 잠시, 그 데이터를 다양한 '인공지능' 모델에 넣어봤고, 그 중 가장 잘 맞는 모델이 '딥러닝' 모델이었다는 어느 SNS 포스팅을 보여주며 자기들이 굉장한 실력자들이 모여 있다고 자랑을 해 놓으셨더라.

그 때나 지금이나, 그렇게 교과서에 소개된 모델들을 파이썬(Python)에서 제공해주는 다양한 계산 라이브러리에 넣어보고 어느 계산이 제일 잘 맞더라는 식의 연구는 연구가 아니라 단순한 코드 돌려보기 예습 과제 정도로 취급하기 때문에 적잖이 충격을 먹었는데, 그 이후로도 비슷한 종류의 논문들을 공학 연구자들 사이에서 뿐만 아니라 의학 연구자들, 심지어는 언론정보학, 사회학 연구자들 논문에서도 본 적이 있다. 국내 대학들의 학위 과정이 얼마나 충격적으로 운영되고 있는지를 알 수 있는 대목 중 하나다.

'예/아니오' 데이터 잘 맞춘다고 무조건 좋은 모델 아니야

'예/아니오' 혹은 '0/1'로 구분되는 이분화된 결과값을 맞추는 계산 작업은 주어진 데이터에서 그 모델의 정확도보다 유사한 데이터에서 반복적으로 계속 잘 맞아들어갈 수 있는지에 대한 강건성 검증(Robustness Verification)이 필수적으로 진행되어야 한다.

머신러닝 분야에서는 '훈련 데이터(Training data)'와 함께 '검증 데이터(Test data)'를 분리해 위와 같은 강건성 검증을 진행하는데, 틀린 방법은 아니지만 데이터의 유사성이 계속적으로 반복되는 경우에만 국한된다는 한계가 있는 계산법이다.

좀 더 쉽게 이해하기 위해 예시를 들면, 주가 데이터는 대표적으로 유사성이 깨지는 데이터로 알려져 있는데, 지난 1년치 데이터를 뽑아 1~6월간 데이터를 훈련 데이터로 삼아 만든 모델 중 7~12월간 데이터에 적용해서 가장 잘 맞는 모델을 찾았다고 해도 그 다음해, 혹은 과거 데이터에서 같은 수준의 정확도를 얻기는 굉장히 어렵다. 전문 연구자들 사이에 우스개 소리로 "0% 맞는게 당연할텐데, 같은 수준의 정확도가 0%라면 말이 되겠지"라는 식으로 의미 없는 계산이라는 평가를 돌려서 표현하는데, 유사성이 지속적으로 반복되지 않는 경우에 '0/1'을 잘 맞추는 모델을 찾는 것이 얼마나 무의미한 계산인지에 대한 이해에 도움이 될 것이다.

데이터의 유사성에 대한 지표로 일반적으로 쓰는 정보가 주파수 데이터 등의 분석에 쓰이는 주기성이고, 고교 수준의 수학으로 표현할 경우 '사인(Sine)', '코사인(Cosine)' 함수 등이 있다. 유사한 방식으로 주기성이 반복되는 데이터가 아니라면 '0/1' 구분을 이번 검증 데이터에서 잘 한다는 이유로 외부의 새로운 데이터에서도 잘 할 수 있을 것이라는 기대를 하면 안 되는 것이다.

이렇게 반복성이 낮은 데이터를 데이터 사이언스 분야에서는 '고(高)분산 데이터 (High noise data)'라고 부르고, 막대한 컴퓨터 계산비용을 써가면서까지 '인공지능'으로 알려진 딥러닝 등의 모델을 쓰는 대신, 일반적인 선형 회귀모델을 이용해 데이터 간의 관계를 설명하는데 활용한다. 특히, 데이터의 분산 구조가 정규분포, 포아송분포, 베타분포 등, 연구자들 사이에 익히 알려진 분포일 경우에는 선형 회귀식, 혹은 유사한 수식 기반 모델을 쓰는 것이 계산비용을 지불하지 않고도 높은 정확도를 얻을 수 있다는 것이 이미 회귀분석 개념이 확정된 1930년대부터 통계학계에서는 상식처럼 받아들여지고 있는 지식이다.

고분산, 저분산 데이터 마다 적절한 계산법 다른 것 인지해야

국내에서 많은 공학 연구자들이 이 부분을 모르고 무조건 '딥러닝'이라는 '고급' 계산법을 쓰면 더 좋은 결론을 얻을 수 있다고 착각하는 것은, 공학계에서 쓰는 데이터들이 주파수 형태의 '저(低)분산 데이터(Low noise data)'이기 때문에 학위 과정 중에 고분산 데이터를 다루는 방법 자체를 배우지 않기 때문이다.

또한 머신러닝 계열의 모델들이 저분산 데이터에서 반복적으로 나타나는 비선형 구조를 파악해내는데 특화된 모델인만큼, '0/1' 정확도를 넘어선 일반화에 대한 도전이 삭제되어 있다. 대표적으로 머신러닝 교과서에 등장하는 계산법 중 '로지스틱 회귀분석(Logistic regression)'을 제외하면 그 어떤 계산법도 통계학계에서 모델 검증에 활용하는 데이터 분산 기반 분석 방법을 쓸 수가 없다. 애당초 모델의 분산을 계산할 수 없기 때문이다. 학계에서는 이를 '1차 모먼트(1st moment)' 모델들을 '2차 모먼트(2nd moment)' 기반 검증에 쓸 수 없다고 표현한다. 분산, 공분산 등이 일반에 알려진 '2차 모먼트'의 종류다.

이 같은 '1차 모먼트' 기반 계산이 낳는 또 하나의 큰 문제는 각 변수간 상관관계에 대해 합리적인 설명을 할 수 없다는 것이다.

예를 하나 들어보자.

위의 식은 대학 학점(UGPA)이 고교 학점(HGPA), 수능 성적(SAT), 출석 유무(SK)에 얼마나 많이 좌우받는지를 확인하기 위해 만든 단순한 회귀식이다. 각각의 변수 간의 문제를 차치하고, 위의 식이 합리적으로 계산되었다는 가정아래, 고교 학점이 학부 학점을 결정 짓는데 무려 41.2%나 영향을 주는 반면, 수능 성적은 불과 15% 밖에 영향을 주지 않는다는 것을 확인할 수 있다.

'1차 모먼트' 기반의 머신러닝 계열의 계산은 결과적으로 대학교 학점을 얼마나 잘 맞추는지에만 초점이 맞춰져 있을 뿐, 각각의 변수들이 얼마나 큰 영향을 주고 있는지 확인하기 위해서는 추가적인 모델 변형을 거쳐야 하는 경우도 있고, 아예 포기해야 하는 경우도 있다. 심지어 그 계산의 정확도를 검증하기 위해 진행할 수 있는 '2차 모먼트' 기반의 통계량 검증도 불가능하다. 고교 시절 배운 학생-t 분포(Student-t distribution) 기반의 통계량 검증을 따를 경우, 위의 모델에서 41.2%와 15%는 모두 합리적인 수치라는 것을 알 수 있지만, 머신러닝 계열의 계산은 유사한 방식의 통계량 검증이 불가능하다.

'해석가능한 인공지능'이라는 표현이 나오는 이유

국내에서 '해석가능한 인공지능'이라는 표현이 언론이나 서점 등등에서 자주 등장하는 것을 본 적이 있을 것이다. 머신러닝 계열의 모델들이 '1차 모먼트' 값만 전달하는 맹점을 갖고 있기 때문에 발생하는 문제가 바로 해석이 불가능하다는 것이다. 위의 사례에서 보듯이, 변수간 얼마나 깊은 관계를 갖고 있는지, 그 관계 값이라는 것을 믿을 수 있는지, 새로운 데이터에서도 유사하게 등장하는지 등등의 문제에 대해서 기존 통계학 방법론 수준의 신뢰 가능한 대답을 해 줄 수 없기 때문이다.

다시 '보스턴 집 값 데이터 어디까지 써 봤니?'라는 타이틀로 홈페이지를 만들었던 어느 대기업 지원 데이터 모임으로 돌아가면, 그 분들 중에 머신러닝 계열 기반 모델이 위와 같은 문제를 갖고 있다는 것을 아는 분들이 한 명이라도 있었다면 여러 모델을 써 봤는데 그 중에 '딥러닝'이 가장 잘 맞더라는 이야기를 자신있게 SNS 상에서 꺼내고, 그 정도로 코드를 돌릴 수 있으니 자신들이 전문가라며 나한테 이메일을 보낼 수 있었을까?

우리가 익히 알고 있다시피 부동산 가격은 정부 정책에 어마어마하게 큰 영향을 받고, 주변의 교육 환경, 교통 접근성에도 많은 영향을 받는다. 한국만 그런 것이 아니라, 해외 거주 경험을 기반으로 할 때 해외 주요 대도시들도 상황은 크게 다르지 않았다. 굳이 따지자면 한국적인 특징으로 아파트의 브랜드가 좀 더 영향을 주는 변수인 것 같기는 하다.

집의 크기, 방의 숫자 등등은 다른 조건들이 동일할 때에나 의미가 있고, 그 외에도 창문이 남향인지 남동, 남서향인지, 판상형인지 등등이 주요한 변수들일텐데, 당시 인터넷에 돌아다니던 보스턴의 집 값 데이터는 그런 핵심 데이터가 모두 사라진, 단순히 코드가 잘 돌아가는지 점검해 볼 수 있는 예시 데이터에 불과했었다.

인공지능 쓰면 정확도가 99%, 100% 될 수 있는거 아닌가요?

또 자주 듣던 표현 중 하나가, "통계학으로는 정확도를 못 높여도, 인공지능 쓰면 99%, 100% 정확도 만들어 낼 수 있는거 아닌가요?"라는 표현이었는데, 아마 당시 질문자가 의미했던 '인공지능'은 일반에 알려진 '딥러닝', 혹은 같은 계열의 '뉴럴 네트워크(Neural Network)' 모델들이었을 것이다.

우선 위의 간단 회귀분석 식의 모델 설명력은 45.8%다. 위의 R제곱 값이 .458로 된 것에 확인하시면 된다. 이 모델을 다른 '복잡한', '인공지능' 모델을 쓰면 99%, 100%까지 끌어올릴 수 있지 않느냐는 질문이었을 것이다. 위의 데이터는 학교 인근 지역의 월세 변화량이 인구 변화, 가구당 수입변화, 학생들 비중 변화와 얼마나 큰 관련이 있는지를 확인하기 위한 계산이다. 위에서 설명한대로, 부동산의 가격이라는 것이 정부 정책, 교육, 교통을 비롯해 수 많은 변수에 영향을 받는다는 것을 알면, 저 모델을 100% 정확도로 맞추기 위해 가장 확실한 방법은 월세로 월세를 맞추는 방법 밖에 없다는 것이 이해가 될 것이다. X를 넣어서 X를 찾아내는건 누구나 다 할 수 있는 일이지 않나?

그 외에는 월세 결정에 미치는 수 많은 변수들을 단순한 식으로 완벽하게 맞추는 것이 불가능하다는 것 정도는 상식선을 벗어나지 않으니 더 설명이 필요없으리라 생각한다. 99%, 100% 정확도를 도전이라도 해 볼 수 있는 영역은 사회과학 데이터가 아니라, 실험실에서 정형화된 결과물이 반복적으로 나오는 데이터, 위에서 쓴 표현을 빌리면 '저분산 데이터'들이다. 대표적인 예시는 문법에 맞춘 문장을 써야하는 언어 데이터, 기괴한 그림을 제외하고 난 이미지 데이터, 규칙에 맞춰 전략을 짜야하는 바둑 같은 게임들이다. 우리가 일상에서 만나는 고분산 데이터들을 99%, 100% 맞추는 것은 불가능한 것이 당연한데도 불구하고, 한 때 정부의 모든 발주 인공지능 프로젝트들의 기본 요건이 '반드시 딥러닝을 쓸 것', '반드시 100% 정확도를 보여줄 것'이었다.

다시 위의 식으로 돌아와서, 학생 인구 증가율, 전체 인구 증가율은 월세 상승률에 큰 영향을 주지 못하는 반면, 소득 성장률은 월세 증가에 50% 달하는 매우 큰 영향을 주는 것을 확인할 수 있다. 또한, 전체 인구 증가율은 고교 시절 배운 학생-t 분포(Student-t distribution) 기반의 통계량 검증을 따를 경우 통계량이 1.65 정도에 불과해 0과 다르지 않다는 가설을 기각할 수 없어, 통계적으로는 유의미하지 않은 변수라는 결론이 나온다. 이어 학생 인구 증가율은 0과 다른 값, 그래서 유의미한 값이라는 판단을 내릴 수는 있으나 정작 월세 증가율에 0.56%라는 매우 적은 영향만을 주고 있음을 확인할 수 있다.

위의 계산 해석은 '딥러닝'으로 알려진 '인공지능' 계산으로는 원칙적으로 불가능하고, 유사한 해석을 하기 위해서 막대한 계산 비용과 고급 데이터 과학 연구 방법을 써야 한다. 그렇게 많은 계산 비용을 지불한다고 해서 45.8%에 불과했던 설명력을 크게 끌어올릴 수도 있는 것이 아닌 점이, 이미 데이터들이 로그 값으로 변화되어 변화율에만 초점을 맞추고 있기 때문에, 데이터 속의 비선형 관계가 단순 회귀 모델에 내재화되어 있다.

한국 사회가 '딥러닝'으로 알려진 모델에 대한 잘못된 이해 탓에 굉장히 많은 학습 비용을 지불하며 엉뚱한 연구들에 인력과 자원을 쏟아 붓는 실수를 저질렀는데, 위의 단순 회귀분석 기반의 예시를 바탕으로 '인공지능'으로 알려진 계산법의 한계를 인식하고 지난 6년간 연구자들과 같은 실수를 저지르지 않기를 바란다.

Picture

Member for

6 months 3 weeks
Real name
Keith Lee
Bio
Head of GIAI Korea
Professor of AI/Data Science @ SIAI

SNS heavy users have lower income?

SNS heavy users have lower income?
Picture

Member for

6 months 3 weeks
Real name
Keith Lee
Bio
Head of GIAI Korea
Professor of AI/Data Science @ SIAI

Modified

One-variable analysis can lead to big errors, so you must always understand complex relationships between various variables. 
Data science is a model research project that finds complex relationships between various variables.
Obsessing with one variable is a past way of thinking, and you need to improve your way of thinking in line with the era of big data.

When providing data science speeches, when employees come in with wrong conclusions, or when I give external lectures, the point I always emphasize is not to do 'one-variable regression.'

To give the simplest example, from a conclusion with an incorrect causal relationship, such as, "If I buy stocks, things will fall," to a hasty conclusion based on a single cause, such as women getting paid less than men, immigrants are getting paid less than native citizens, etc. The problem is not solved simply by using a calculation method known as 'artificial intelligence', but you must have a rational thinking structure that can distinguish cause and effect to avoid falling into errors.

SNS heavy users end up with lower wage?

Among the most recent examples I've seen, the common belief that using social media a lot causes your salary to decrease continues to bother me. Conversely, if you use SNS well, you can save on promotional costs, so the salaries of professional SNS marketers are likely to be higher, but I cannot understand why they are applying a story that only applies to high school seniors studying intensively to the salaries of ordinary office workers.

Salary is influenced by various factors such as one's own capabilities, the degree to which the company utilizes those capabilities, the added value produced through those capabilities, and the salary situation of similar occupations. If you leave numerous variables alone and do a 'one-variable regression analysis', you will come to a hasty conclusion that you should quit social media if you want to get a high-paying job.

People may think ‘Analyzing with artificial intelligence only leads to wrong conclusions?’

Is it really so? Below is a structured analysis of this illusion.

Source=Swiss Insitute of Aritifial Intelligence

Problems with one-variable analysis

A total of five regression analyzes were conducted, and one or two more variables listed on the left were added to each. The first variable is whether you are using SNS, the second variable is whether you are a woman and you are using SNS, the third variable is whether you are female, the fourth variable is your age, the fifth variable is the square of your age, and the sixth variable is the number of friends on SNS. all.

The first regression analysis organized as (1) is a representative example of the one-variable regression analysis mentioned above. The conclusion is that using SNS increases salary by 1%. A person who saw the above conclusion and recognized the problem of one-variable regression analysis asked a question about whether women who use SNS are paid less because women use SNS relatively more. In (11.8), we differentiated between those who are female and use SNS and those who are not female and use SNS. The salary of those who are not female and use SNS increased by 1%, and conversely, those who are female and use SNS also increased by 2%. Conversely, wages fell by 18.2%.

Those of you who have read this far may be thinking, 'As expected, discrimination against women is this severe in Korean society.' On the other hand, there may be people who want to separate out whether their salary went down simply because they were women or because they used SNS. .

The corresponding calculation was performed in (3). Those who were not women but used SNS had their salaries increased by 13.8%, and those who were women and used SNS had their salaries increased only by 1.5%, while women's salaries were 13.5% lower. The conclusion is that being a woman and using SNS is a variable that does not have much meaning, while the variable of being given a low salary because of being a woman is a very significant variable.

At this time, a question may arise as to whether age is an important variable, and when age was added in (4), it was concluded that it was not a significant variable. The reason I used the square of age is because people around me who wanted to study ‘artificial intelligence’ raised questions about whether it would make a difference if they used the ‘artificial intelligence’ calculation method, and data such as SNS use and male/female are simply 0/ Because it is 1 data, the result cannot be changed regardless of the model used, while age is not a number divided into 0/1, so it is a variable added to verify whether there is a non-linear relationship between the explanatory variable and the result. This is because ‘artificial intelligence’ calculations are calculations that extract non-linear relationships as much as possible.

Even if we add the non-linear variable called the square of age above, it does not come out as a significant variable. In other words, age does not have a direct effect on salary either linearly or non-linearly.

Finally, when we added more friends in (5), we came to the conclusion that having a large number of friends only had an effect on lowering salary by 5%, and that simply using SNS did not affect salary.

Through the above step-by-step calculation, we can confirm that using SNS does not reduce salary, but that using SNS very hard and focusing more on friendships in the online world has a greater impact on salary reduction. It can also be confirmed that the proportion is only 5% of the total. In fact, the bigger problem is another aspect of the employment relationship expressed by gender.

Numerous one-variable analyzes encountered in everyday life

When I meet a friend in investment banking firms, I sometimes use the expression, ‘The U.S. Federal Reserve raised interest rates, thus stock prices plummeted,’ and when I meet a friend in the VC industry, I use the expression, ‘The VC industry is difficult these days because the number of fund-of-funds has decreased.’

On the one hand, this is true, because it is true that the central bank's interest rate hike and reduction in the supply of policy funds have a significant impact on stock prices and market contraction. However, on the other hand, it is not clear in the conversation how much of an impact it had and whether only the policy variables had a significant impact without other variables having any effect. It may not matter if it simply does not appear in conversations between friends, but if one-variable analysis is used in the same way among those who make policy decisions, it is no longer a simple problem. This is because assuming a simple causal relationship and finding a solution in a situation where numerous other factors must be taken into account, unexpected problems are bound to arise.

U.S. President Truman once said, “I hope someday I will meet a one-armed economist with only one hand.” This is because the economists hired as economic advisors always come up with an interpretation of event A with one hand, while at the same time coming up with an interpretation of way B and necessary policies with the other hand.

From a data science perspective, President Truman requested a one-variable analysis, and consulting economists provided at least a two-variable analysis. And not only does this happen with President Truman of the United States, but conversations with countless non-expert decision makers always involve concerns about delivering the second variable more easily while requesting a first variable solution in the same manner as above. Every time I experience such a reality, I wish the decision maker were smarter and able to take various variables into consideration, and I also think that if I were the decision maker, I would know more and be able to make more rational choices.

Risks of one-variable analysis

It was about two years ago. A new representative from an outsourcing company came and asked me to explain the previously supplied model one more time. The existing model was a graph model based on network theory, a model that explained how multiple words connected to one word were related to each other and how they were intertwined. It is a model that can be useful in understanding public opinion through keyword analysis and helping companies or organizations devise appropriate marketing strategies.

The new person in charge who was listening to the explanation of the model looked very displeased and expressed his dissatisfaction by asking to be informed by a single number whether the evaluation of their main keyword was good or bad. While there are not many words that can clearly capture such likes and dislikes, there are a variety of words that can be used by the person in charge to gauge the phenomenon based on related words, and there is information that can identify the relationship between the words and key keywords, so make use of them. He suggested an alternative.

He insisted until the end and asked me to tell him the number of variable 1, so if I throw away all the related words and look up swear words and praise words in the dictionary and apply them, I will not be able to use even 5% of the total data, and with less than that 5% of data, I explained that assessing likes and dislikes is a very crude calculation.

In fact, at that point, I already thought that this person was looking for an economist with only one hand and was not interested in data-based understanding at all, so I was eager to end the meeting quickly and organize the situation. I was quite shocked when I heard from someone who was with me that he had previously been in charge of data analysis at a very important organization.

Perhaps the work he did for 10 years was to convey to superiors the value of a one-variable organ that creates a simple information value divided into 'positive/negative'. Maybe he understood that the distinction between positive and negative was a crude analysis based on dictionary words, but he was very frustrated when he asked me to come to the same conclusion. In the end, I created a simple pie chart using positive and negative words from the dictionary, but the fact that people who analyze one variable like this have been working as data experts at major organizations for 1 years seems to show the reality in 'AI industry'. It was a painful experience. The world has changed a lot in 1 years, so I hope you can adapt to the changing times.

Picture

Member for

6 months 3 weeks
Real name
Keith Lee
Bio
Head of GIAI Korea
Professor of AI/Data Science @ SIAI

High accuracy with 'Yes/No' isn't always the best model

High accuracy with 'Yes/No' isn't always the best model
Picture

Member for

6 months 3 weeks
Real name
Keith Lee
Bio
Head of GIAI Korea
Professor of AI/Data Science @ SIAI

Modified

With high variance, 0/1 hardly yields a decent model, let alone with new set of data
What is known as 'interpretable' AI is no more than basic statistics
'AI'='Advanced'='Perfect' is nothing more than mis-perception, if not myth

5 years ago. Just not long after an introduction of simple 'artificial intelligence' learning material that uses data related to residential areas in the Boston area to calculate the price of a house or monthly rent using information such as room size and number of rooms was spread through social media. An institution that claims they do hard study in AI together with all kinds of backgrounds in data engineering and data analysis requested me to give a speeach about online targetting ad model with data science.

I was shocked for a moment to learn that such a low-level presentation meeting was being sponsored by a large, well-known company. I saw a SNS post saying that the data was put into various 'artificial intelligence' models, and that the model that fit the best was the 'deep learning' model. That guy showed it off and boasted that they had a group of people with great skills.

I was shocked for a moment to learn that such a low-level presentation meeting was being sponsored by a large, well-known company. I saw a SNS post saying that the data was put into various 'artificial intelligence' models, and that the model that fit the best was the 'deep learning' model. He showed them off and boasted that they had a group of people with great skills.

Back then and now, studies such as putting the models introduced in textbooks into the various calculation libraries provided by Python and finding out which calculation works best are treated as a simple code-run preview task rather than research. I was shocked, but since then, I have seen similar types of papers not only among engineering researchers, but also from medical researchers, and even from researchers in mass communication and sociology. This is one of the things that shows how shockingly the most degree programs in data science are run.

Just because it fits ‘yes/no’ data well doesn’t necessarily mean it’s a good model

The calculation task of matching dichotomous result values ​​classified as 'yes/no' or '0/1' is robustness verification that determines whether the model can repeatedly fit well with similar data rather than the accuracy of the model on the given data. ) must be carried out.

In the field of machine learning, robustness verification as above is performed by separating 'test data' from 'training data'. Although this is not a wrong method, it has the limitation that it is limited to cases where the similarity of the data is continuously repeated. This is a calculation method.

To give an example to make it easier to understand, stock price data is known as data that typically loses similarity. Among the models created by extracting the past year's worth of data and using the data from 1 to 1 months as training data, it is applied to the data from 6 to 7 months. Even if you find the best-fitting model, it is very difficult to obtain the same level of accuracy in the following year or in past data. As a joke among professional researchers, the evaluation of a meaningless calculation is expressed in the following way: “It would be natural to be 12% correct, but it would make sense if the same level of accuracy was 0%.” However, in cases where the similarity is not repeated continuously, ‘ It will help you understand how meaningless a calculation it is to find a model that fits '0/0' well.

Information commonly used as an indicator of data similarity is periodicity, which is used in the analysis of frequency data, etc., and when expressed in high school level mathematics, there are functions such as 'Sine' and 'Cosine'. Unless the data repeats itself periodically in a similar way, you should not expect that you will be able to do it well with new external data just because you are good at distinguishing '0/1' in this verification data.

Such low-repeatability data is called ‘high noise data’ in the field of data science, and instead of using models such as deep learning, known as ‘artificial intelligence’, even at the cost of enormous computer calculation costs, general A linear regression model is used to explain relationships between data. In particular, if the distribution structure of the data is a distribution well known to researchers, such as normal distribution, Poisson distribution, beta distribution, etc., using a linear regression or similar formula-based model can achieve high accuracy without paying computational costs. This is knowledge that has been accepted as common sense in the statistical community since the 1930s, when the concept of regression analysis was established.

Be aware of different appropriate calculation methods for high- and low-variance data

The reason that many engineering researchers in Korea do not know this and mistakenly believe that they can obtain better conclusions by using an 'advanced' calculation method called 'deep learning' is that the data used in the engineering field is 'low-dispersion data' in the form of frequency. This is because, during the degree course, you do not learn how to handle highly distributed data.

In addition, as machine learning models are specialized models for identifying non-linear structures that repeatedly appear in low-variance data, the challenge of generalization beyond '0/1' accuracy is eliminated. For example, among the calculation methods that appear in machine learning textbooks, none of the calculation methods except 'logistic regression' can use the data distribution-based analysis method used for model verification in the statistical community. This is because the variance of the model cannot be calculated in the first place. Academic circles express this as saying that ‘1st moment’ models cannot be used for ‘1nd moment’-based verification. Variance and covariance are commonly known types of ‘second moment’.

Another big problem that arises from such 'first moment'-based calculations is that a reasonable explanation cannot be given for the correlation between each variable.

$${\hat{UGPA}_i} = \underset{1.39}{0.33} + \underset{0.412}{0.094} HGPA_i + \underset{0.15}{0.011} SAT_i - \underset {0.083}{0.026} SK_i $$

Let's take an example.

The above equation is a simple regression equation created to determine how much college GPA (UGPA) is influenced by high school GPA (HGPA), CSAT scores (SAT), and attendance (SK). Putting aside the problems between each variable and assuming that the above equation was calculated reasonably, it can be confirmed that high school GPA influences as much as 41.2% in determining undergraduate GPA, while CSAT scores only influence 15%. there is.

As a result, machine learning calculations based on 'first moment' only focus on how well college grades are matched, and additional model transformation is required to check how much influence each variable has. There are times when you have to give up completely. Even verification of statistics based on 'second moment', which can be performed to verify the accuracy of the calculation, is impossible. If you follow the statistical verification based on the Student-t distribution learned in high school, you can see that 1% and 2% in the above model are both reasonable figures, but machine learning series calculations use similar statistics. Verification is impossible.

Why the expression ‘interpretable artificial intelligence’ appears

You may have seen the expression ‘Interpretable artificial intelligence’ appearing frequently in the media, bookstores, etc. The problem that arises because machine learning models have the blind spot of transmitting only the ‘first moment’ value is that interpretation is impossible. As seen in the above example, it cannot provide reliable answers at the level of existing statistical methodologies to questions such as how deep the relationship between variables is, whether the value of the relationship can be trusted, and whether it appears similarly in new data. Because.

If we go back to a data group supported by a large company that created a website with the title ‘How much Boston house price data have you used?’, if there was even one person among them who knew that models based on machine learning series had the above problems, Could they have confidently said on social media that they have used several models and found 'deep learning' to be the best among them, and sent me an email saying they are experts because they can run the code to that extent?

As we all know, real estate prices are greatly influenced by government policies, as well as the surrounding educational environment and transportation accessibility. Not only is this the case in Korea, but based on my experience living abroad, the situation is not much different in major overseas cities. If I were to be specific, the brand of the apartment seems to be a more influential variable due to its Korean characteristics.

The size of the house, the number of rooms, etc. are meaningful only when other conditions are the same, and other important variables include whether the windows face south, southeast, southwest, plate type, etc. Data on house prices in Boston that were circulating on the Internet at the time were All such core data had disappeared, and it was simply example data that could be used to check whether the code was running well.

If you use artificial intelligence, wouldn't accuracy be 99% or 100% possible?

$$\log({\hat{rent})} = \underset{.043}{.844} + \underset{.066}{.039} \log{(pop)} + \underset{.507}{.039} \log{(avginc)} + \underset{.0056}{.0017} pctstu $$

$$ n = 64, R^2 = .458$$

Another expression I often heard was, “Even if you can’t improve accuracy with statistics, isn’t it possible to achieve 99% or 100% accuracy using artificial intelligence?” Perhaps the ‘artificial intelligence’ that the questioner meant at the time was general. It would have been known as 'deep learning' or 'neural network' models of the same series.

First of all, the model explanatory power of the simple regression analysis above is 45.8%. You can check that the R-squared value above is .458. The question would have been whether this model could be raised to 99% or 100% by using other ‘complex’ and ‘artificial intelligence’ models. The above data is a calculation to determine how much the change in monthly rent in the area near the school is related to population change, change in income per household, and change in the proportion of students. As explained above, knowing that the price of real estate is affected by numerous variables, including government policy, education, and transportation, it is understood that the only surefire way to fit the model with 100% accuracy is to match the monthly rent by monthly rent. It will be. Isn’t finding X by inserting X something that anyone can do?

Other than that, I think there is no need for further explanation as it is common sense that it is impossible to perfectly match the numerous variables that affect monthly rent decisions in a simple way. The area where 99% or 100% accuracy can even be attempted is not social science data, but data that repeatedly produces standardized results in the laboratory, or, to use the expression used above, 'low-variance data'. Typical examples are language data that requires writing sentences that match the grammar, image data that excludes bizarre pictures, and games like Go that require strategies based on rules. Although it is natural that it is impossible to match 99% or 100% of the highly distributed data we encounter in daily life, at one time the basic requirements for all artificial intelligence projects commissioned by the government were 'must use deep learning' and 'must have 100% accuracy.' It was to show '.

Returning to the above equation, we can see that the student population growth rate and the overall population growth rate do not have a significant impact on the monthly rent increase rate, while the income growth rate has a very large impact of up to 50% on the monthly rent increase. In addition, when the overall population growth rate is verified by statistics based on the Student-t distribution learned in high school, the statistic is only about 1.65, so the hypothesis that it is not different from 0 cannot be rejected, so it is a statistically insignificant variable. The conclusion is: Next, the student population growth rate is different from 0, so it can be determined that it is a significant value, but it can be confirmed that it actually has a very small effect of 0.56% on the monthly rent growth rate.

The above computational interpretation is, in principle, impossible using 'artificial intelligence' calculations known as 'deep learning', and a similar analysis requires enormous computational costs and advanced data science research methods. Paying such a large computational cost does not mean that the explanatory power, which was only 45.8%, can be greatly increased. Since the data has already been changed to logarithmic values ​​and only focuses on the rate of change, the non-linear relationship in the data is internalized in a simple regression model. It is done.

Due to a misunderstanding of the model known as 'deep learning', industries made a shameful mistake of paying a very high learning cost and pouring manpower and resources into the wrong research. Based on the simple regression analysis-based example above, ' We hope to recognize the limitations of the computational method known as 'artificial intelligence' and not make the same mistakes as researchers over the past six years.

Picture

Member for

6 months 3 weeks
Real name
Keith Lee
Bio
Head of GIAI Korea
Professor of AI/Data Science @ SIAI