Forever a student: Chinese characters

Showing posts with label Chinese characters. Show all posts

Thursday, August 21, 2014

Derived characters

While I was still at the Chinese department, during our lectures on Chinese writing, our professors taught us about 6 Chinese character types: pictograms (象形字), simple indicatives (指事字), semantic compounds (會意字), phono-semantic compounds (形聲字), phonetic loans (假借字) and derived characters (轉注字). (For further reading on Chinese character types see this post).

While they explained the first 5 quite in detail, when talking about the last sixth category, we were told that these still require further research and that no one really understands them well. Or so they said.

望 wang4 'hope, expect' is a derived character. Let's look at its definition from the 說文解字 (100 CE) dictionary first:

望：出亡在外，望其還也。从亡，朢省聲。

It took me quite some time to figure out what this means. 出亡在外，望其還也 is the definition itself. After quite a bit of research on all the characters in the definition I think the translation should be: 望: To run away (from home) and disappear. Looking into the distance for the disappeared one. The从亡，朢省聲 part defines the elements in the character. It says: formed by 亡 wang2 ‚perish, disappear‘ semantic and a reduced 朢 wang4 ‚full moon‘ phonetic.

If you look closely at the朢 character, the 臣 chen2 ‚subject, servant‘ element at the top left has been replaced by 亡 to form 望.

Now the theory goes, that derived characters are characters where one element in an existing character that has a sound we need for a certain new character is removed and replaced by another element to form this new character. The original character acts as a phonetic element and the newly inserted character acts as the semantic element.

Some other examples:

毀 hui3 ‚destroy, ruin‘ formed by 土 tu3 ‚earth, soil‘ semantic and a reduced 毇 hui3 ‚beat grain‘ semantic. The 米 mi3 ‚rice‘ element has been extracted from 毇 replaced by 土. The residual combination of 臼 and 殳 would be an inexistent character.

浸 jin4 ‚soak‘ formed by 氵(水) shui3 'water' semantic and a reduced 侵 qin1 ‚invade’ phonetic. 侵 qin1 ‚invade‘ originally meant ‚to proceed‘ and was formed by 亻(人) ren2 'person' semantic and 帚 zhou3 'broom' semantic (today written with 又 instead of 巾 at the bottom). It was thus a semantic compound (a person sweeping the floor with a broom ‚proceeding‘ in a certain direction), not a phono-semantic compound. The 亻(人) ren2 'person' semantic has been replaced by 氵(水) shui3 'water' semantic in浸, but帚 zhou3 'broom' has phonetically or semantically nothing to do with 浸 and is just a residue of 侵 after亻(人) ren2 'person' has been removed.

畿 ji1 ‚territory around the capital‘ formed by 田 tian2 ‚field‘ semantic and a reduced幾 ji3 ‚several, few, how many?‘ phonetic. The 人 ren2 ‚person‘ element at the bottom left has been extracted from幾 and replaced by 田. This character is even more messy, because originally 幾 was written as a combination of two 幺 yao1 'small‘ over戍 shu4 'patrolling soldier' with the meaning ‚dangerous‘ (small patrolling soldier or few patrolling soldiers, danger, dangerous). In the幾character it just so happens to be that scholars responsible for the formatting created a character where the left bottom part of 戍 resembles 人 ren2 ‚person‘ which was then substituted by 田 to form畿.

Another interesting example is遊 you2 ‘wander, walk around’. To understand its etymology we have to go two steps back. The base character for 遊 you2 is汓 qiu2 ‘float, hover, drift’. Later a phono-semantic 游 you2 ‘swim, float; walk around, travel’ was created with the original meaning ‘movement of a flag in wind’ (㫃 yan3 ‘flag’ semantic jammed into 汓 qiu2 ‘float’ phonetic to create 游). Finally 遊 you2 ‘wander, walk around’ was created by removing氵(水) shui3 'water' and replacing it by 辶 (辵) chuo4 'go, walk' semantic. If you remove辶 from 遊 what is left is a non-existent character, only a residue of游 without 氵with no meaning or sound.

The key difference between these derived characters and phono-semantic compounds, where one part of the character represents the sound another part its meaning is, that while phono-semantic compound characters can be nicely separated into two full-quality standalone characters, derived characters cannot, or if separated, standalone characters will not represent their original phonetic values as was the case with 浸 jin4 ‚soak‘. You can separate 清 qing1 ‚clear‘ into氵(水) shui3 'water' semantic and 青 qing1 ‚green, blue‘ phonetic, or 情 qing2 ‘emotion’ into忄(心) xin1 'heart, mind' semantic and青 qing1 ‚green, blue‘ phonetic, but you can’t separate望 that way. 亡 wang2 'perish' is the semantic element in 望 but the rest is a non-existent character, in other words 月 over 壬 doesn't exist and means nothing. It is just a leftover of the original 朢 character after 臣has been removed.

Sunday, November 17, 2013

Chinese character etymology and Chinese character phonetic series

First lecture in the hopefully longer series on Chinese character etymology and Chinese character phonetic series. In this lecture I try to explain what phono-semantic compound characters are (形聲字), explain the 才 phonetic series and etymology of all characters in it.

Characters in this video:

才 cai2 - talent, material. Leading phonetic character of the group.

財 cai2 - money, wealth
材 cai2 - material
在 zai4 - to be located at
載 zai4 - to give someone a ride
裁 cai2 - to cut
戴 dai4 - to wear (clothes), to put on

Monday, September 23, 2013

Understanding Chinese characters

Introduction

Chinese characters are a very complex system of recording the Chinese language into writing. Most of what seems to be a mix of illegible symbols is part of a logical but complex writing system that has been gradually developed around 2300 - 3000 years ago, with oldest confirmed characters dating back to around 1200 - 1050 BC. In this article I will try to briefly explain what one needs to know in order to understand Chinese characters and what you should know before you start studying them

Some basic facts:

The earliest confirmed evidence of the Chinese script yet discovered is the body of inscriptions on oracle bones from the late Shang dynasty (1200-1050 BC) - Wikipedia
According to some studies (including my own), you need to know only about 2500 characters to read the newspaper.
About 80% of the characters are made up of two elements - one responsible for the sound and one for the meaning of the character. This means that there is something in the character that will tell you how to read it and something else that will tell you what it means. 80% is a huge number and if you learn how to read this type of characters and understand their system, your learning progress will be much faster.

Benchmarks in character evolution

Oracle bone script

The earliest preserved characters, that can be reasonably proven and dated are the inscriptions on Oracle bones. These were bones (usually scapulae) of large animals (usually ox) or turtle shells that were used in ancient Chinese fortune telling. A small concave was drilled onto the bone (probably after the animal has been sacrificed) and a glowing piece of coal was placed in it. A person responsible for the ritual then blew on the piece of coal which cracked the bone and based on the direction of the crack, the answer to the question of the fortune teller was 'yes' or 'no'. The question, the result of the fortune telling along with other details was then inscribed in characters onto the bone itself. The character system that was used is called 甲骨文 - The oracle bone script.

The discovery of these bones is relatively recent (1928). Fragments of these bones were sold in a Chinese medicine shop in a province in China, until someone noticed that they had these inscriptions on them. According to Wikipedia, they have been traced to a village near Anyang in Henan province.

Seal script

Before the unification of China in 221 BC, there were no universal rules for writing characters. There were several ways of writing the same character, with varying shapes, stroke orders and stroke types. Several local writing systems have been developed as well. After China has been united by the Qin dynasty in 221 BC and the Warring states period has ended, the First emperor Qin shi huang has decided to abolish all existing forms of writing and ruled that the only form of writing to be used was the one used in the state of Qin (developed gradually during the Warring states period) - a script called the Seal script today.

Regular script

The Seal script preserved its official status for a relatively short period of time. Other scripts started to emerge, with some of them rising to dominance.

Regular script (the way characters are written today - both traditional and simplified) has been attributed to Zhong Yao, of the Eastern Han to Cao Wei period (ca. 151–230 AD), who has been called the “father of regular script”. However, some scholars postulate that one person alone could not have developed a new script which was universally adopted, but could only have been a contributor to its gradual formation. It was not until the Southern and Northern Dynasties that regular script rose to dominant status. During that period, regular script continued evolving stylistically, reaching full maturity in the early Tang Dynasty.- Wikipedia.

Transitions

The three mentioned scripts are the benchmarks in evolution of Chinese characters because for the most part, the later directly derive from the earlier, they each have risen to prominence for an extended period of time and respected dictionaries often refer to at least the Seal script versions for a better understanding of character etymology.

The above picture shows three versions of the character 人 ren2 'person' in all three scripts. In this particular case, the character is simple and has not undergone a lot of change. The changes are more formal than structural.

The next picture shows the character 化 hua4 'change' in all three versions. As it is a simple character, you still can't see any big structural changes between the Oracle bone and the Seal scripts, formal changes have been made in the transition from the Seal to the Regular script.

As you can see, both sides of the Regular script character have been changed. The left 人 has been contracted to 亻, which is a rule in the Regular script. Lots of standalone characters, if they are parts of other characters, mostly on the left side are somehow contracted and this is one example of it.

The original character was a picture of a person 人 and another person turned upside down, hence the meaning 'change'. Since the character did not significantly change in form in its transition into the Seal script, its etymology can be easily understood there. In the Regular script however, this is not the case. The Seal script is therefore a very important step in understanding character etymology, since in many cases it preserves the shapes of the Oracle bone script better than the Regular script.

Another example is the character 伐 fa2 'attack, to send an expedition' (formed by 人 ren2 'person' and 戈 ge1 'weapon') in all three scripts. Notice how 人 preserved its shape in the Oracle bone and Seal scripts but again has been arbitrarily changed in the Regular script.

The features of the transitions in the above mentioned characters are all only simple examples, but are very frequent. There are more complicated ones however. The following example is one of them and also shows how important the Seal script in particular helps us understand character etymology:

The picture shows the character 乏 fa2 'to lack' which is the mirror image of the character 正 zheng4 'correct, precise'. The inversion has been done on purpose to point to the meaning of the character and can be clearly seen in the Seal script, is however completely lost in the regular script. In the Regular script, it consists of a 丿pie3 'left falling stroke contracted' at the top and 之 zhi1 'to go (which has many other different meanings as well)' both of which have nothing to do with the meaning or the sound of the character as a whole.

These two elements have been chosen arbitrarily by the scribes in the transition from the Seal script into the regular script and this sort abbreviation is a frequent feature of the whole process. The Seal script is a simplification of the Oracle bone script and the Regular script is a simplification of the Seal script (and modern Simplified characters further simplify the traditional characters of the Regular script). Since the scribes only had a handful (hundreds) of elements to choose from for the transition and they had to choose elements that would resemble the shape of the seal script most, in the case of 乏 they ended up choosing 丿 and 之.

The seal script is also very helpful in understanding phono-semantic compound characters as the following example shows:

The top row shows the 父 fu4 'father' character as written in the Seal and Regular scripts. The second row shows the character 布 bu4 'cloth'. The character is composed of 巾 jin1 'towel (semantic element) and 父 fu4 - phonetic element. In the seal script, you can clearly see, that 父 is part of the 布 character and acts as the phonetic element in it, in the regular script it has been simplified into two strokes and is not recognizable anymore.

80 %

As mentioned before, about 80% of the characters today are characters, where one part of the character will tell you how to read it and another part of it will tell you what the character means (as is the case with the above mentioned example of 父 and 布 for instance). These are called Phono-semantic compounds (PSC). 80% is a huge number and it is safe to say that Chinese Characters today can be divided into these compound characters and the rest.

When first characters started to originate, they were simple pictures of objects, some of which (very few compared to the total number) are still in use today. Some of these characters are 人 (person), 龜 (turtle), 日 (sun), 月 (moon), 門 (door). Whoever was inventing these characters very soon must have realized that this way of recording a language was very impractical because:

there was no relation to the sound in the character and unless told, no one was exactly sure how to read it
it might have been easy to create small pictures of concrete objects, but abstract terms, verbs, adverbs, prepositions ect. must have been very difficult if not impossible to create.
apart from the fact that there is no relation to the sound or the way a picture should be read, there is also no clear relation to the meaning. A picture of a standing man can represent ' a person, a man, to stand, to be patient...' and probably lots of other things.
characters did not have a standard form, stroke order or stroke number. Quite possibly every time someone tried to write something and did not have an existing text at hand to compare it to, the shape, stroke number and order of some characters must have changed by accident. Some characters had almost 20 versions.
those who were inventing characters started to realize that it would be impossible to create as many characters as there are words, objects, actions, situations ect. and some sort of combination would have to be necessary.

To partly overcome the problem of defining abstractness, the scribes started to combine the meanings of existing characters into new ones (for instance 女 nv3 'woman' and 子 zi3 'child' was combined into 好 hao3 'good') or started to employ character loans (我 wo3 - originally a character meaning 'axe, weapon' composed of 扌shou3 'hand' and 戈 ge1 'axe', used for the 1st personal pronoun 'I, me' because the Ancient Chinese words for 'axe' and 'I, me' had the same or similar pronunciation). To overcome multiple meaning ambiguity, they started adding indicators to existing characters, pointing to their meanings (木 mu4 'wood' 本 ben3 'roots'; 刀 dao1 'knife' 刃 ren4 'edge of a blade'; 日 ri4 'sun' 旦 dan4 'dawn'),

This however to a large extent still did not solve the problem of pronunciation and the problem of comprehension also still prevailed. Probably after sound loans have been introduced, instead of purely combining the meanings of two characters, the scribes started to combine them in a way, where of the two or more characters chosen for combination, one character was chosen to point to the meaning and another character was chosen to point to the sound of the character as a whole. This method proved itself to be historically the most effective and prevalent one as today, more than 80% of characters in use are of this type. In the 康熙字典 - a huge and respected dictionary of the Emperor KangXi from the year 1710 AD - more than 90 % of all characters are phono-semantic compounds.

Phono-semantic compounds explained

The above table shows the 才 character entry from the Etymological phonetic dictionary that I'm working on. 才 cai2 is the leading phonetic character for this group, only the semantic elements change. 才 is a very good character to explain PSCs on because it is both a regular and an irregular compound.

The most prevalent form of PSCs today is a one where the semantic element is on the left side and the phonetic element is on the right side as is the case with the first two characters 財 and 材 which can be called regular PSCs. I call them regular, simply because they are the most frequent ones. Actually in this case, 才 is a perfect phonetic as it matches the syllable (initial and final both) and the tone as well.

In 在 zai4 however, the 才 phonetic element is on the left side and it has been corrupted (but clearly visible in the Seal script version of the character). I call this an irregular PSC. 才 is also not a perfect phonetic element in this case (cai2 Vs. zai4) but still works very well compared to some other PSCs.

The 存 cun2 character is not a PSC, but a meaning-meaning compound. 才 is clearly a part of it as a co-semantic element on the left (see explanation).

才 is the phonetic again in the following character zai2. It has been corrupted into a 十 at the top left. This character is not used as a standalone character today, but has been chosen as a new phonetic element in the following three characters and as a co-signific in the last one.

Conclusion

For understanding character etymology, understanding the earlier versions of modern characters, especially the Seal script is very helpful. Many phonetic or semantic elements have been simplified or corrupted and are not recognizable anymore in the modern versions.
You do not need to know 50 000 characters to read the newspaper or books. According to studies, 2500 characters is enough to read the newspaper. According to Wikipedia, the Dictionary of the Emperor KangXi contains 47 000, characters, but 40% of these are graphic variants. I would assume, that most of the remaining characters are place names, people's names or names of local dishes, animals, plants or rarely used objects.
Most of Chinese characters (about 80%) are phono-semantic compounds, where one element in the character points to the sound and another element in it points to the pronunciation of the character as a whole. Learning the system behind this type of characters will improve your learning curve significantly.
One of the main problems while studying characters thus is to learn the so called leading phonetic characters for each phonetic group (as is 才 in this article) as they are usually meaning-meaning compounds or simplified pictures with no indication as to how they should be pronounced.

Thursday, November 15, 2012

New Youtube channel

Hello everyone,

I have launched a new Youtube channel as a supplement to my blog, where I would like to share some ideas about langauge learning. I'm currently working on the How to write Chinese characters playlist in which you can find videos explanaining in detail how to write Chinese characters. In each video I explain how to write these characters, explain what writing rules apply to them and what details to look out for when writing them in order to write them correctly and give a little background about their structure and history. The characters for these videos were selected based on my character frequency research starting from the most frequent one. You can find more information about my character frequency study here.

In the future, I would like to do more videos like this on Mandarin Chinese pronunciation and other langauges as well. I would also like to record interviews with other fellow language learners and post them on my channel.

Hope you enjoy the channel and if you the videos useful, feel free to subscribe.

Vladimir

Monday, November 5, 2012

Chinese character frequency list - Interview articles

Abstract

In this study I tried to analyze the Chinese character composition of about 60 interview articles in two Taiwanese online magazines, evaluate the data, produce a character frequency chart, character knowledge vs text recognition chart, do absolute character prediction calculations and compare the data with previous analyses that I have done. I sampled a total of 45 235 characters and found that there was a total of 1865 unique characters in this sample. Based on my calculations I also found that in order to recognize 100% (using the word 'to recognize' and not 'to understand' on purpose throughout the article) of any given number of interview articles, one needs to know 2084 unique Chinese characters. When comparing this data to my previous news character analyses I found, that the interview character frequency list contains much more direct speech elements than the news article character frequency one does and I've mathematically proven, that interview articles are easier to read for beginner and intermediate students of Mandarin Chinese than news articles are.

Introduction

In the past posts I was trying to analyze the frequency of words and characters based on the data that I sampled over the period of 6 weeks from 4 section of Taiwanese news (please see the Character frequency analysis, Word frequency analysis and Character prediction analysis articles for more information).

In my study I found that there was a total of 2105 unique characters and 5901 unique words in the 80 articles I analyzed which were separated into four sections: 國際 (international), 政治 (domestic politics), 社會 (society) and 財經 (economics). After extending my research and trying to predict, what the number of unique characters and words in any given number of articles would be I found that if these news articles were from the same 4 news sections I analyzed, there would be a total of 2174 unique characters and 8424 unique words in any given number of news articles in these four sections and a person would thus need to know this many characters and words to recognize 100% of any given number of news articles, from these for news sections.

In this post I would like to present the results of my analysis of interview articles found in two Taiwanese magazines - Cheers and 天下雜誌. I chose to analyze interviews, because they come very close to spoken Mandarin and the words and expressions used in them differ greatly from the language used in news articles that I analyzed previously.

Research method

The data analysis method and character end ratio prediction method used in this paper are the same as in my previous posts. For more information, please see Character frequency analysis, Word frequency analysis and Character prediction analysis.

My data sampling method was slightly different in this case, since I only had to analyse character frequency and not word frequency. What I did was, that I entered as many interviews as I could find on the 天下雜誌's and Cheers magazine's websites into a text file, cleaned it up by removing any numerals, commas, roman letters ect. and produced a raw file containing 45 235 characters. I have sampled a larger amount of data for this analysis (data greater by 18,77%) than in the News character frequency analysis (38 085) and used this ratio where necessary while comparing the results of these two analyses to make up for the difference.

Analysis results

* numbers adjusted by 18,77% for comparison purposes with news article data

The first 100 most frequent interview characters

One thing you can notice right away is that the characters in this chart seem very basic and familiar, even to beginner students. This is by all means true, because we are dealing with data that comes from interview transcriptions, which even if edited for print still represent direct speech and spoken Mandarin much better than Mandarin found in news articles or books does (see 白話 for more information).

Character knowledge Vs. Interview text recognition

In the above chart you can see the ratio between the amount of characters you know and the percentage of interview text you recognize. As with my previous analyses the most frequent characters account for much more of the text than the less frequent ones do, which means that by knowing a relatively small amount of characters you are able to recognize a relatively large amount of text, but learning to recognize the remaining 5% of the text will take you the same amount of time as learning to recognize the first 95% did.

Character prediction chart

Based on the sampled data and using the methods I used in the Character prediction analysis I calculated that one would need to know a total of 2084 characters to be able to recognize 100% of any given number of interview articles. The 'Total sample' row represents the total amount of characters in my interview article data sample. The 'Unique' row represents the total number of unique characters found in this data sample. The 'Unique estimated' row represents the estimated total number of characters necessary to know in order to recognize 100% of any given number of interview articles. The 'Estimated at' row represents the amount of characters in which the Unique estimated characters would be found. This means that based on my calculations, after having read Interview articles that would contain a total of 82 803 characters, you would not encounter any new unique characters. This estimation is only a mathematical calculation and serves for orientation purposes only. Please see Character prediction analysis for more info.

News and Interview data comparison

The most interesting part of this study was the comparison of results between the News article analysis and this analysis. Following are the results from the News character frequency analysis from my previous posts:

The first obvious thing you notice is, that based on the Interview articles data sample, there were only 1867 unique characters found in the interview data compared to 2105 unique characters found in the news data, which is interesting, because the sample for the news analysis was smaller by almost 20%. The reason for this will be explained the later in this article.

The first 100 most frequent news characters

The first main difference when comparing this table with the interview frequency chart is the lack of typical direct speech elements: personal pronouns 我 and 你/妳 missing, verbs used for describing feelings and opinions, typical direct speech conjunctions ect.

As you might notice, the number of times that unique characters occurred in news articles is lower than the one of the interview articles (e.g. News: 的=737, Interviews: 的=1826). This is mostly due to the fact that I sampled more data in the interview articles analysis. Its influence on the character frequency order is relatively small, since we're dealing with two completely different sets of data. In other analytic operations, where this change would have caused major result differences, I adjusted the calculations by 18,77% (difference in the amount of sampled data between the two analyses) in order to make up for the difference.

News and Interview character knowledge Vs text recognition

Based on the above chart and some other calculations I found that, you need to know less characters in order to recognize a greater percentage of interview articles in Mandarin, than you need to know in news articles in Mandarin in the beginning stages of your studies (at the stage where the student knows up to about 1000 characters). Later on, at around 1500 characters, this advantage becomes marginal. Since interviews represent direct speech pretty well, one might imply that you would also need to know less characters or better yet, less morphemes to speak or write spoken Mandarin, than write articles in News article type Mandarin (please see 白話 for more information).

In my study, when it comes to Interview articles, the 100 most frequent characters accounted for 56% of the sampled text, while in the case of News articles, the 100 most frequent characters accounted only for 38% of the sampled text. This means, that by knowing the same amount of characters in the beginning stages one will recognize much more of the interview articles than news articles.

Based on my calculations, your greatest advantage comes at knowing the first 87 most frequent characters. By knowing the first 87 most frequent characters from the Interview articles frequency list you will be able to recognize 52,54% of the text found in interview articles, while knowing the same number of characters from the news frequency list will only let you recognize 35,37% of the news articles text, which is a 17,17% difference.

While a 17,17% difference might not seem like much and is a little hard to imagine when it comes to a concept as abstract as the relative difficulty of perception of understanding two different types of texts, I tried to present it in another way and thought of contrasting two blue colors, in which one would be 17,17% less blue than the other to give you a feel of what this difference in perception looks like in another case.

I know this example is a little far fetched, but since we're talking about the relative difference in perception of understanding two things (in this case the different perception of difficulty of news articles and interview articles paralleled with the different perception of the same color in two different shades) I thought I could try to make this parallel and see how it works. The color to the left is 17,17% less blue than the color to the right:

Absolute character estimation comparison

Interview character estimation News character estimation

The above two tables compare the predicted absolute amounts of characters necessary to read any given number of articles (for more information on this research method please see Character prediction analysis ). As you can see, the predictions differ by only 90 unique characters.

In order to recognize 100% of any given number of news articles, based on my calculations you would need to know 2174 unique characters and in order to recognize 100% of any given number of interview articles, you would need to know 2084 unique characters.

An interesting thing that I found was that even though this absolute character prediction difference is relatively small, based on this study you would still need a lot less characters to recognize a much greater amount of data in case of interviews than you'd need in the case of news articles (only 2084 unique characters to understand 82 803 characters of interview data compared with 2174 unique characters to understand only 53 776 characters of news data).

My explanation for this is that as can be seen in the chart above, the most frequent characters in Interviews (about the first 1000) are used much more often throughout the Interview articles and thus can cover more data.

Conclusion

You need to learn less characters in order to read Interview articles than you need to learn in order to read news articles in beginner/intermediate stages of your studies.
Interviews are easier to understand for beginner/intermediate students.
The estimated absolute amount of characters for reading both news and interviews is roughly the same (news 2174, interviews 2084), the difference is you can recognize more data by knowing less characters when reading interview articles at the beginning/intermediate stages, because more frequent characters account for more of the text in case of interview articles.
The composition of the characters found in the interview and frequency charts are different in that in the news character frequency chart, typical direct speech elements are much less frequent than they are in the interview frequency chart.
I would recommend interview reading to news reading for intermediate students.

You can download the full interview character frequency list in the download section of this blog.

Monday, March 19, 2012

Amount of characters and words necessary to read news articles

Abstract

Hello everyone and welcome to my never ending study again. In the last two posts I was trying to count the number of unique Chinese characters and words in Taiwanese news by analyzing 80 news articles from Taiwan over the period of six weeks. In my study I found there there was a total of 2105 unique characters and 5901 unique words in the 80 articles I analyzed which were separated into four sections: 國際 (international), 政治 (domestic politics), 社會 (society) and 財經 (economics), but as I said, 80 articles was not enough and I tried to extend the study. Using the sampled data I did some calculations and tired to predict what the number of unique characters and words in any given number of articles would be. I found that there would be a total of 2174 unique characters and 8424 unique words and a person would thus need to know this many characters and words to recognize 100% of any given number of news articles, if these news articles were from the same 4 news sections I analyzed.

Introduction

The main task was to predict what the evolution of the unique character and word charts would be and at what point on the y-axis they'd stop ascending. The corresponding x-axis value to that point would be the total amount of characters necessary for a person to know in order to recognize 100% of a random news article as long as it would be from one of the 4 sampled news sections. As you can see by looking at the following two charts, both of them have ascending trends with the Word knowledge chart having a sharply ascending ending with seemingly no approximation to any number.

I therefore looked at the source data again and with the help of OpenOffice Calc tried to calculate at what number would the two charts 'stall' (there is a mathematical expression for this, I only vaguely remember the Slovak term for it, which is probably 'asymptota') and what the total number of unique Mandarin characters and words in any given number of articles in Mandarin would be.

Research method

Since my research method was the same both in the Character and Word data prediction analyses, I will only explain what I did by describing the former one. I noticed, that on the last 50% section of the character chart, there were small fluctuations that could be of help in order to determine what the trend in the development would be.

As you can see these fluctuations are too small in order to make a reasonable prediction possible. I had to come up with a way to augment these fluctuations and in order to do that, I had to choose a completely different approach, which turned out to be a pretty complicated thing to do.

The first two charts in this article only show that by knowing for instance the first 800 of the most frequent characters one will be able to recognize about 90% of the 80 analyzed articles. In order to calculate the trend for an X number of articles and since I could only work with the data I had at hand, I had to turn the whole study around and calculate, how many unique characters there would have been, had I only analyzed 40 articles, 45 articles or 50 articles and try calculate a reasonable prediction based on that.

Since I had the data already processed and all in one file, I didn't remember where one article finished and where another one started and since these articles were of different sizes, I chose an average calculation over an article-by article one, since I thought it would be more precise.

What I basically did was, that I took all the sampled data (80 articles), that I used for the word analysis (the text file where I manually put each word onto a new line), put it into the first column in OpenOffice Calc (Microsoft Excel equivalent), calculated the exact 50% of that amount and ran my friends program to tell me how many unique characters were in those 50% of the total amount of data. In the next column I put 55% of the total amount of data, ran the program again and got the unique character number of characters for 55% of the data. In the third column I put 60% of the total amount of data and ran the program again to get the unique number of characters at 60% of the data until I got to 100%. What I had now was the unique character occurrence in the last 50% of the total amount of data from the original 80 articles, separated into 5% chunks.

Percent of data	50%	55%	60%	65%	70%	75%	80%	85%	90%	95%	100%
Unique characters	1750	1813	1850	1892	1928	1966	2000	2013	2058	2077	2105

I now looked at the increases in these unique occurrences with the growing amount of data and tried to figure out how to calculate the trend past the 100% mark. As you can see the number of unique characters was increasing by amounts that were in general getting smaller and smaller with each incremental 5%. To my great delight, OpenOffice Calc has a function that can calculate trends for you, and although it took me a while to figure out how to exactly do it, I managed to produce the following chart

I assigned the value of '1' to 55% (since the value of 0 represented 50%), '2' to 60%, '3' to 65% ect. with the value of '10' representing 100%. I plotted these values on the x-axis. On the y-axis I plotted the incremental number of unique characters corresponding to the increasing amount of data. This for instance means that at value 1, which is 55% of data, there were 63 incremental unique characters that occurred in this data compared to the previous 50% of data.

Then I simply continued past the value of 10, which was equal to 100% of the original data and let OpenOffice calculate the trend for me. Everything you see on the chart that goes past the value of 10 on the x-axis is the trend calculated by OpenOffice Calc. As you can see on the chart, the graph hit zero at the value slightly over 17 (17,24 to be precise) which is equal to 141.2%, which means that at this point there would be no more incremental unique characters occurring regardless of whether the amount of data would continue increasing or not.

Analysis results

Percent of data	100%	105%	110%	115%	120%	125%	130%	140%	141%
Incremental unique characters	0	18,87	15,84	12,82	9,79	6,77	3,75	0,72	0

The number 18,87 in the third column for instance means, that had I sampled 105% of data instead of the original 100%, there would have been an additional 18,87 unique characters in it. All there was left to do was to take the original total amount of unique characters that I found in the 100% of the original articles (2105) and add it up with the sum of the predicted unique character increase. The total predicted number that I got to was 2174. This means, that according to my data and trend calculations, if you were to continue reading only these 4 news sections from Taiwan, you would not find more than 2174 unique characters in them, no matter how many articles you would read. I can thus say, that based on my data and trend calculations, one needs to know 2174 characters in order to be able recognize 100% of any given number of news articles, provided that they are found in the 4 news sections I analyzed.

Only for reference, the number of total characters at which the predicted graph hit zero, and thus the number of characters after which no more new unique characters occurred was 53 776 (141.2% of the original total amount of characters in the 80 sampled articles), which roughly corresponds to 113 news articles. This would mean, that after reading 113 news articles in the 4 sections I analyzed one would not come across any new characters.

Word prediction chart

I did the same trend calculations with the Unique word occurrence chart as I did with the Character prediction chart, but as I mentioned before, because of the 5% error margin, the number that I came up with is really just a very rough estimation. Plus it is evident from looking at the Word knowledge Vs. Text recognition chart in the beginning of this article, that the chart was still in sharp ascent at the end of the table and could develop in a lot of unpredictable ways, so the following chart is really only a very rough estimation

After doing the calculations and plotting the data on the chart I found that there would be a total of 8424 unique words that one would need to know in order to recognize all words in any given number of news articles in the 4 sections of Taiwanese news I analyzed. The predicted trend chart hit zero at the 36,54 value which corresponds to 235,69% of the original sampled data or 49 705 characters. These would be found in 189 articles, which means that according to my trend predictions, you would not encounter any new words after having read 189 articles.

Conclusion

Below is a table, in which you can find the end results of the entire study. Even though the amount of articles I sampled was really small and predictions that I made were only estimates, in my opinion at least in the case of unique character calculations, the results were quite precise. To my biggest surprise it really seems like one does not need to know more than 2175 unique characters in order to recognize 100% of any number of news articles as long as they would be found in one of the 4 news sections I analyzed.

	Characters	Words
Original articles	80	80
Total amount	38085	21089
Unique	2105	5901
Unique estimated	2174	8424
Estimated at	53776	49705
Estimated at (articles)	113	189

My guess would be, that the reason for such a small number of unique characters necessary to read the news would be, that a great number of these characters would not be present or would rarely be used in books and on the other hand in books there would be characters that would not be present or rarely be used in news articles. Another interesting conclusion is, that you would need to know almost 4 times as many unique words as you would need to know unique characters, which again only shows, that the amount of unique words that one knows is more important than the amount of unique characters.

Finally, since this study was only a study of text recognition, in the future I would also like to do a study on text understanding and finally come up with a number of unique characters and words one would need to know in order to understand news articles, books or MSN chats. If everything goes well, I would like to do similar analyses of books texts and real life speech.

Pages