Estimation of economic indicator from big blog data

We introduce a systematic method to estimate an economic indicator from the Japanese government by analyzing big Japanese blog data. Our method consists of four parts: (i)variable selection, (ii) grouping, (iii) round robin, and (iv) regression analysis.
In this study, an explained variable is the index of business conditions especially the Composite Index (CI) which is announced monthly by the cabinet office of Japan and it describes a condition of the Japanese economy quantitatively. Explanatory variables are monthly word frequencies. We adopt 1352 words in the section of economics and industry of the Nikkei thesaurus for each candidate word to illustrate the economic index.
(i) Variable selection: by applying Fisher’s exact test, we selected the words having similar patterns and trends seen in the index (CI), and as a result we have 416 words which have a strong correlation with CI. We show a typical example in Fig. 1 where the word is不景気 (recession) and we can observe a clear negative correlation.
(ii) Grouping: we make the words into some groups which have similar patterns and trends in order to avoid the multicollinearity and the overfitting problem in the regression analysis, by applying Fisher’s exact test to all pairs of two words from 416 words selected from (i).
(iii) Round robin (detection of spurious correlation): we introduce a new method to detect a spurious correlation between two words’ frequencies based on Baysian inference.
As a result of (i), (ii) and (iii), we got 17 representative words such as 不景気 (recession), 外国銀行 (foreign banks) etc., which don’t have a strong correlation and a spurious correlation between each pair of frequency of word.
(iv) Regression analysis: we make a model which reconstructs CI by the regression analysis with frequency of words. In order to decide how many words we need for explanatory variable, we propose a hypothesis testing which compares the frequency of words and artificial time series which follows the random walk model, and we find using 7 words for the regression analysis is the most reasonable. Our model illustrates the real economy index CI very well.
The announcement of an economic index from government usually has a time lag, on the other hand our method has promptness. Also, our method can be apply not only estimating economic indicator but also any situations where we find good explanatory values to illustrate an explained value, and it is one of the most important tasks in big data analysis.

Συνεδρία:

Information and Communication Technologies

Authors:

Kenta Yamada, Hideki Takayasu and Midsako Takayasu

Room:

Date:

Tuesday, September 25, 2018 - 11:45 to 12:00

Partners

Twitter

Facebook

Contact