Monday, March 19, 2012

Amount of characters and words necessary to read news articles

Abstract

Hello everyone and welcome to my never ending study again. In the last two posts I was trying to count the number of unique Chinese characters and words in Taiwanese news by analyzing 80 news articles from Taiwan over the period of six weeks. In my study I found there there was a total of 2105 unique characters and 5901 unique words in the 80 articles I analyzed which were separated into four sections: 國際 (international), 政治 (domestic politics), 社會 (society) and 財經 (economics), but as I said, 80 articles was not enough and I tried to extend the study. Using the sampled data I did some calculations and tired to predict what the number of unique characters and words in any given number of articles would be. I found that there would be a total of 2174 unique characters and 8424 unique words and a person would thus need to know this many characters and words to recognize 100% of any given number of news articles, if these news articles were from the same 4 news sections I analyzed.

Introduction

The main task was to predict what the evolution of the unique character and word charts would be and at what point on the y-axis they'd stop ascending. The corresponding x-axis value to that point would be the total amount of characters necessary for a person to know in order to recognize 100% of a random news article as long as it would be from one of the 4 sampled news sections. As you can see by looking at the following two charts, both of them have ascending trends with the Word knowledge chart having a sharply ascending ending with seemingly no approximation to any number.





I therefore looked at the source data again and with the help of OpenOffice Calc tried to calculate at what number would the two charts 'stall' (there is a mathematical expression for this, I only vaguely remember the Slovak term for it, which is probably 'asymptota') and what the total number of unique Mandarin characters and words in any given number of articles in Mandarin would be.

Research method

Since my research method was the same both in the Character and Word data prediction analyses, I will only explain what I did by describing the former one. I noticed, that on the last 50% section of the character chart, there were small fluctuations that could be of help in order to determine what the trend in the development would be. 





As you can see these fluctuations are too small in order to make a reasonable prediction possible. I had to come up with a way to augment these fluctuations and in order to do that, I had to choose a completely different approach, which turned out to be a pretty complicated thing to do.

The first two charts in this article only show that by knowing for instance the first 800 of the most frequent characters one will be able to recognize about 90% of the 80 analyzed articles. In order to calculate the trend for an X number of articles and since I could only work with the data I had at hand, I had to turn the whole study around and calculate, how many unique characters there would have been, had I only analyzed 40 articles, 45 articles or 50 articles and try calculate a reasonable prediction based on that.

Since I had the data already processed and all in one file, I didn't remember where one article finished and where another one started and since these articles were of different sizes, I chose an average calculation over an article-by article one, since I thought it would be more precise. 

What I basically did was, that I took all the sampled data (80 articles), that I used for the word analysis (the text file where I manually put each word onto a new line), put it into the first column in OpenOffice Calc (Microsoft Excel equivalent), calculated the exact 50% of that amount and ran my friends program to tell me how many unique characters were in those 50% of the total amount of data. In the next column I put 55% of the total amount of data, ran the program again and got the unique character number of characters for 55% of the data. In the third column I put 60% of the total amount of data and ran the program again to get the unique number of characters at 60% of the data until I got to 100%. What I had now was the unique character occurrence in the last 50% of the total amount of data from the original 80 articles, separated into 5% chunks. 

Percent of data 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 100%
Unique characters 1750 1813 1850 1892 1928 1966 2000 2013 2058 2077 2105



I now looked at the increases in these unique occurrences with the growing amount of data and tried to figure out how to calculate the trend past the 100% mark. As you can see the number of unique characters was increasing by amounts that were in general getting smaller and smaller with each incremental 5%. To my great delight, OpenOffice Calc has a function that can calculate trends for you, and although it took me a while to figure out how to exactly do it, I managed to produce the following chart



I assigned the value of '1' to 55% (since the value of 0 represented 50%), '2' to 60%, '3' to 65% ect.  with the value of '10' representing 100%. I plotted these values on the x-axis. On the y-axis I plotted the incremental number of unique characters corresponding to the increasing amount of data. This for instance means that at value 1, which is 55% of data, there were 63 incremental unique characters that occurred in this data compared to the previous 50% of data. 

Then I simply continued past the value of 10, which was equal to 100% of the original data and let OpenOffice calculate the trend for me. Everything you see on the chart that goes past the value of  10 on the x-axis is the trend calculated by OpenOffice Calc. As you can see on the chart, the graph hit zero at the value slightly over 17 (17,24 to be precise) which is equal to 141.2%, which means that at this point there would be no more incremental unique characters occurring regardless of whether the amount of data would continue increasing or not. 

Analysis results 

Percent of data 100% 105% 110% 115% 120% 125% 130% 140% 141%
Incremental unique characters 0 18,87 15,84 12,82 9,79 6,77 3,75 0,72 0



The number 18,87 in the third column for instance means, that had I sampled 105% of data instead of the original 100%, there would have been an additional 18,87 unique characters in it. All there was left to do was to take the original total amount of unique characters that I found in the 100% of the original articles (2105) and add it up with the sum of the predicted unique character increase. The total predicted number that I got to was 2174. This means, that according to my data and trend calculations, if you were to continue reading only these 4 news sections from Taiwan, you would not find more than 2174 unique characters in them, no matter how many articles you would read. I can thus say, that based on my data and trend calculations, one needs to know 2174 characters in order to be able recognize 100% of any given number of news articles, provided that they are found in the 4 news sections I analyzed.

Only for reference, the number of total characters at which the predicted graph hit zero, and thus the number of characters after which no more new unique characters occurred was 53 776 (141.2% of the original total amount of characters in the 80 sampled articles), which roughly corresponds to 113 news articles. This would mean, that after reading 113 news articles in the 4 sections I analyzed one would not come across any new characters.

Word prediction chart

I did the same trend calculations with the Unique word occurrence chart as I did with the Character prediction chart, but as I mentioned before, because of the 5% error margin, the number that I came up with is really just a very rough estimation. Plus it is evident from looking at the Word knowledge Vs. Text recognition chart in the beginning of this article, that the chart was still in sharp ascent at the end of the table and could develop in a lot of unpredictable ways, so the following chart is really only a very rough estimation


After doing the calculations and plotting the data on the chart I found that there would be a total of 8424 unique words that one would need to know in order to recognize all words in any given number of news articles in the 4 sections of Taiwanese news I analyzed. The predicted trend chart hit zero at the 36,54 value which corresponds to 235,69% of the original sampled data or 49 705 characters. These would be found in 189 articles, which means that according to my trend predictions, you would not encounter any new words after having read 189 articles. 

Conclusion

Below is a table, in which you can find the end results of the entire study. Even though the amount of articles I sampled was really small and predictions that I made were only estimates, in my opinion at least in the case of unique character calculations, the results were quite precise. To my biggest surprise it really seems like one does not need to know more than 2175 unique characters in order to recognize 100% of any number of news articles as long as they would be found in one of the 4 news sections I analyzed.





CharactersWords
Original articles8080
Total amount3808521089
Unique 21055901
Unique estimated21748424
Estimated at5377649705
Estimated at (articles)113189



My guess would be, that the reason for such a small number of unique characters necessary to read the news would be, that a great number of these characters would not be present or would rarely be used in books and on the other hand in books there would be characters that would not be present or rarely be used in news articles. Another interesting conclusion is, that you would need to know almost 4 times as many unique words as you would need to know unique characters, which again only shows, that the amount of unique words that one knows is more important than the amount of unique characters.

Finally, since this study was only a study of text recognition, in the future I would also like to do a study on text understanding and finally come up with a number of unique characters and words one would need to know in order to understand news articles, books or MSN chats. If everything goes well, I would like to do similar analyses of books texts and real life speech.

5 comments:

  1. Excellent research there, thanks for sharing this data with us! I've come up with similar results for lexical coverage in Russian and Arabic news corpora in the past too, whilst I've found that the numbers for spoken dialogue in English movies are even more encouraging (around the 95% coverage mark with the 3,000 most frequent word families and additional proper nouns/loanwords). :)

    ReplyDelete
    Replies
    1. Hi Teango. I was planning to sample movies too, but I'm not sure how to do it yet. How did you sample them? Did you take the subtitle files and run some sort of a computer program on them?

      Vladimir

      Delete
  2. Так точно! The most frequent vocabulary in movies turns out to be considerably different from what you'd usually find in literary sources, as well as less varied as you'd expect, and this is great, as it offers a much clearer indication of what is really needed in an immersion trip abroad.

    ReplyDelete
  3. Your studies are very interesting. Many thanks for sharing.
    Eduardo Danezi G

    ReplyDelete