A Corpus-based Analysis on Distributional Patterns of Collocations in Chinese High School English Textbooks

: The paper explores the distributional patterns of collocations identified in Chinese high school textbooks in comparison with collocations found in the native reference corpus. Six series of Chinese high school English textbooks from the six publishers were compiled as the textbook corpus. The British National Corpus (BNC) was selected as the native norm. Three collocation types, adjective + noun collocation (ANC), verb + noun collocation (VNC) and noun + noun collocation (NNC), were investigated in the paper. The distribution of the collocations was compared between the two corpora in terms of four statistical measures: density, diversity, repetition and association strength. The results showed that whereas the diversity of collocations was higher in the native reference corpus than in the textbook corpus, the other measures showed the opposite pattern. This may partly be due to the pedagogical nature of the textbooks. Moreover, the findings revealed that the textbooks represent a comparable number of VNCs, over-represent ANCs and under-represent NNCs in comparison with the native reference corpus. The findings suggest that textbook authors take account of incorporating more diverse collocations and more NNCs to model more native-like texts in the teaching materials targeting higher-grade students.


Introduction
Corpus linguistics provides a powerful tool for exploring a variety of language-related issues, ranging from discovering patterns of actual language use by theoretical linguists and developing teaching materials by language teaching professionals (Reppen & Simpson-Vlach, 2010). One of the characteristics of corpus-based analyses of language is that "it utilizes a large and principled collection of natural texts, known as a 'corpus', as the basis for analysis" (Reppen & Simpson-Vlach, 2010, p.91).
Collocations as one type of formulaic language refer to the sequences or sets of words that commonly co-occur with greater probability than random chance (Reppen & Simpson-Valch, 2010). The statistic conceptualization of collocations naturally leads to the use of corpora for the identification and analysis of collocations. In light of the important role that formulaic language plays in language acquisition and use (Granger, 1998;Lewis, 1993;Nattinger & Decarrico, 1992), collocations have attracted increasing attention in corpus-based studies of language (Gablasova, Brezina, & McEnery, 2017). The resulting body of corpus-based research into collocations provides valuable insights into not only the characteristics of collocations per se but also the vital role that collocations play in language learning and use.
Given the importance of collocational knowledge in L2 development, Tsai (2014), for example, examined verb + noun collocations in the EFL textbooks used in Taiwan and the Chinese learners' writing in comparison with the native speakers' essays (Louvain Corpus of Native English Essays). It is found that the textbooks were comparable to the native speakers' essays regarding collocational density and diversity, but the former did not repeat collocations as frequently as the latter. There are few studies that specifically targeted the revised Chinese high school textbooks and wordlists of the 2017 revised Chinese national curriculum of English. The present study aims to investigate whether collocations used in Chinese high school English textbooks reflect the native-like patterns of collocations. It specifically attempts to explore the distributional patterns of collocations in Chinese high school English textbooks.

Literature review
Frequency-based approach is one of the theoretical approaches to collocations. In the frequencybased approach, collocations are identified by quantitative criteria such as simple frequency of occurrence and statistics that measure the strength of the tendency for individual words to co-occur in a collocation. Collocations are sequences of words that "have a statistical tendency to co-occur" in a corpus (Durrant, 2014, p. 446). Firth, Halliday, and Sinclair are the representatives of this tradition. Firth (1957, p.11) brought the term "collocation" into prominence and emphasized that collocations played a crucial role in establishing the meaning of a word, which could be summarized by his well-known statement: "You shall know a word by the company it keeps". According to Firth (1957, p.12), habitual collocation is a type of "mutual expectancy" between words. To put it differently, when one word is found, the other is likely to be found. This idea of mutual expectancy underlies the conceptualization of collocation as "the relationship a lexical item has with items that appear with greater than random probability in its textual context" (Hoey, 1991, p.7).
The density, diversity, repetition, and association strength have been used to describe the distribution of collocations in a corpus. The four measures have been extensively used to study the distributional characteristics of collocations in a text. For example, Tsai (2015) used density, diversity, and repetition rate to investigate the verb + noun collocations in EFL textbooks in Taiwan and Chinese EFL learners' writings in comparison with the native reference corpus. It was found that textbooks were comparable to the native writings in terms of density. But the textbooks underrepresented the verb + noun collocation types in comparison to the native corpus. Moreover, the repetition of the majority of collocation types in the textbooks was deemed to be insufficient enough for L2 learners to acquire collocations (Tsai, 2005).
Likewise, Kim (2020) examined VNCs, ANCs, and NNCs used in Korean EFL textbooks compared with the native reference corpus in terms of density, diversity, repetition, and association strengths. Findings showed that the textbook corpus, by and large, exhibited higher collocation density and diversity, less repetition, and stronger association strength. Kim suggested that collocation density and diversity are at the cost of less repetition, and teachers should prepare supplemental materials to make up for less repetition of collocations in the textbooks. Moreover, more highly probable collocations were found in the textbooks in comparison to the native corpus. Kim (2020) thus suggested "incorporating less-than-typical collocations into the materials".
The two studies successfully showed the detailed distributional characteristics of collocations in L2 corpora using the four measures. The findings provide useful implications for curriculum developers and textbook writers.

Corpora
The British National Corpus (BNC), which is one of the largest English corpora, was chosen as the native reference corpus in the present dissertation. It contains 100 million words and has been considered one of the most representative corpora of general English currently available. The corpus consists of 90% written language and 10% spoken language from a wide range of sources and covers a wide cross-section of British English from the late 20th century. The written texts in the BNC are extracted from "regional and national newspapers, specialist periodicals and journals for all ages and interests, academic books and popular fiction, published and unpublished letters and memoranda, school and university essays" (http://www.natcorp.ox.ac.uk/corpus/), with a total of 86,097,791 words.
The BNC has been referred to by many researchers to determine the frequency and MI of NNS (non-native speakers) and NS (native speakers) collocations (Durrant & Schmitt, 2009;Siyanova & Schmitt, 2008). It is assumed that combinations which are frequently used in the BNC are representative of common usage in English (Durrant & Schmitt, 2009).
A textbook corpus consisting of six series of textbooks was compiled for collocation analysis. These EFL textbooks followed the Chinese 2017 revised national curriculum and were published by six publishers. They have been widely used in high schools in China. Most series of textbooks contain seven or eight volumes designed for the six terms of the high school English curriculum while some series of textbooks comprise six volumes. For the sake of convenience, the textbook series, based on the publishers, are hereafter referred to as BU ( The electronic textbooks were either downloaded from the official website of the publishers or purchased online. The Optical Character Recognition (OCR) feature in Google Drive was used to convert the PDF version of these textbooks into plain text. Since errors were inevitable in the process of automatic converting, errors were corrected by a manual check.
In Kim's (2020) study, reading passages from middle and high school English textbooks were included. In a similar vein, all the reading passages in the main textbooks and workbooks were included in the present study. In Tsai's (2015) study, instructions, and vocabulary exercises were included in the textbook corpus because it was believed that all collocations in the textbooks, regardless of the sections where they occur should be examined given the fact that textbooks serve as the major source of the language input for EFL learners. However, in the present study, instructions and exercises were excluded in that instructions only consist of limited lexical items that occur frequently and are massively repeated in the textbook, and exercises are usually derived from the previous reading materials.
The textbooks under investigation were published based on the 2017 revised national curriculum. As Table 1 shows, the textbook corpora were compiled into six separate sub-corpora by the publisher. The single word refers to the words appearing in the textbooks, while the node word refers to the nouns in the wordlist specified by the national curriculum. Each of the six textbook sub-corpora included on average 49,410 tokens of 6,038 different types. Among these 6,038 word types, an average of 1,210 (20%) was found to match with the nouns from the curriculum, occurring 8,686 times in each textbook corpus. To put it another way, each series of textbooks may present students with, on average, 1,210 noun types from the curriculum, on average 8,686 tokens in total throughout the high school curriculum. 152 It is noted that the textbook corpus is a collection of six sub-corpora each of which is hypothesized to constitute the language input to learners. The distribution statistics of collocations in the textbook corpus (i.e., density, diversity, repetition rate and association strength) are averages of those from each sub-corpus consisting of the textbooks from a particular publisher.

Statistical criteria
The logDice measure reveals the extent of collocational bonding. It does not favor frequent word combinations nor downgrade the high-frequency combinations. Therefore, the logDice score and frequency of cooccurrence were employed in the present study to identify collocations from the native reference corpus.
Frankenberg-Garcia et al. (2018) found that a threshold of logDice score greater than or equal to 5 works well for identifying lexical collocations. The logDice with a minimum score of 5 is therefore used in this study (Frankenberg-Garcia et al., 2018; Kim, 2020). And the minimum frequency of occurrence for a word combination to be taken to be a collocation was set at 5. The word combinations over both of these cut-off points were identified as collocations, while those below the threshold levels were excluded from the analysis.

Distributional measures
To compare the overall distributional patterns of collocations between the corpora under analysis in this study, four aspects of distribution were analyzed: collocation density, collocation diversity, repetition rate, and association strength. Each of these distributional measures is briefly reviewed below, together with a specific statistic used to quantify it in this study.
Firstly, collocation density reflects in a relative way how many tokens constitute collocations in a corpus. Since the corpora were of different sizes, the collocation density was operationalized as the relative frequencies of collocation tokens per 1,000 words to indicate how many collocations occur within a text of the same length (Laufer & Waldman, 2011).
Secondly, collocation diversity is used to measure the number of collocation types in the target corpus. To minimize the corpus size effect, the formula "collocation diversity = the number of collocation types/the square root of word tokens" was used (Kim, 2020).
Thirdly, the repetition rate is calculated to measure how many times the individual collocation types are repeated throughout the corpus. The formula "Root type-token ratio (RTTR) = total collocation type counts/the square root of the total collocation token counts" was employed to reduce the size effect (Paquot, 2018). It indicates the degree to which the same collocation types are repeated in an inverse way. The higher the RTTR score is, the less repetitive an individual collocation type is.
Lastly, the logDice score indicates the probability that the words occur together. The median of logDice scores based on the collocation types was used as the measure for association strength in the present study. The median of logDice scores is a central tendency measure to indicate the overall association strength of all the collocations. A higher median logDice score is indicative of the overall tendency towards strongly associated collocations. A lower median score, on the other hand, indicates the overall tendency towards less strongly associated collocations. To put it differently, words in such collocations also frequently occur with many other words.

Software
The extraction of word combinations of interest from the reference corpus and the textbook corpus was conducted with SketchEngine, which is a web-based corpus analysis tool (https://www.sketchengine.eu/). It includes a set of software tools to analyse patterns of language use in a corpus. A range of functions is offered on the website, among which the function of 'word sketch' returns typical collocates for a given word. This function produces a simple frequency of occurrence of collocations and the association strength between the node word and the collocate in terms of the logDice score.
A Python script was written and combined with Application Programming Interface (API) requests sent to the SketchEngine web server to search the reference corpus for word combinations including the node nouns. The Python script automatically extracted all the co-occurring words for each of the 1,784 node nouns. The data retrieved by the Python script contain the co-occurring words for each of the node nouns, together with the frequency of occurrence and logDice score for each identified word combination.

Collocation identification in target corpora
To identify collocations, the present study adopted the approach used by Tsai (2015) and Kim (2020). A reference collocation list was first generated from the native reference corpus and used to identify collocations in the textbook corpus. Collocations in the present study were identified through three steps.
Firstly, a reference collocation list was generated from the BNC. Collocations in the reference collocation list meet the minimum criteria with logDice score set at 5 and the co-occurrence frequency set at 5.
Secondly, as was done with the BNC, the Python script was used to retrieve the co-occurring words for 1,784 node nouns from the six textbook corpora respectively. Only verb + noun combinations, adjective + noun combinations, and noun + noun combinations were retrieved.
Thirdly, all the items extracted from the target corpora were checked against the reference collocation list. Those items which were found in the reference collocation list were identified as collocations.

Overall results
The token counts of candidates and collocations are presented in Table 2. It should be noted that candidates refer to the VN combinations, AN combinations, and NN combinations extracted from the native reference corpus while collocations are those chosen from candidates based on the frequency of occurrence and association measure criteria. In the table, "Token counts of collocations" refers to the total number of collocations occurring in the corpus, with all the instances of a collocation counted as the number of tokens for that collocation.  Table 3 presents the type counts of the candidates and collocations in each corpus. "Type counts of collocations" refer to the number of unique collocations occurring in the target corpus.

Collocation Density
The top half of Table 4 shows the total number of collocation tokens in each corpus. There are 2,923 collocation tokens (1,277 VNCs, 1,271 ANCs, and 375 NNCs) on average in each textbook series.
The lower half of the table shows the total number of collocation tokens per 1,000 words, which controls for differences in corpus size and represents the density of collocations in each corpus for the present dissertation. The textbook corpus provides 59.16 collocation tokens (average of collocation counts of each sub-corpus) per 1,000 words while the native reference corpus used 53.27 collocations per 1,000 words. Among these tokens, VNCs (25.84) and ANCs (25.72) are presented more frequently in the textbook corpus than in the native reference corpus (19.24 VNCs, 21.84 ANCs). On the other hand, there are fewer NNCs (7.59) in the textbooks than in the reference corpus (12.16).
A statistically significant difference was found in the density of ANCs and NNCs between the textbook corpus and the reference corpus, but not in VNCs (ANCs: χ 2 =7.9376, p<.05; NNCs: χ 2 =8.3001, p<.05; VNCs: χ 2 =2.5306, p=0.1116). It implies that the textbooks over-represent ANCs and under-represent NNCs in comparison to the native reference corpus. Average collocation tokens per each textbook corpus by publishers 2) Collocation tokens/Word counts * 1,000

Collocation Diversity
The top half of Table 5 shows the total number of collocation types in each corpus, which is taken to indicate the diversity of collocations used in a corpus. The textbook corpus consists of 2, 059 collocation types on average with 1,210 curriculum-based noun types as its node word. The lower half of the table presents collocation diversity rates which take account of corpus size. It is found that the diversity rates of VNC types (3.97), ANC types (4.02), and NNC types (1.27) in the textbook corpus are lower than those in the reference corpus (4.24, 4.61, and 3.73, respectively). It is little surprising that the overall number of collocation types in the textbooks (9.26) is smaller than in the reference corpus (12.58). Given the pedagogical purpose of the textbooks and the limited class time, textbooks cannot include all the types of collocations which are found in the native reference corpus.

Repetition Rate
The repetition rate in the present study was measured with the Root Type-Token Ratio (RTTR). It is calculated by dividing the number of collocation types by the square root of the total number of collocation tokens. It indicates the degree to which the same collocation types are repeated in an inverse way. In other words, a higher RTTR score indicates less repetition of the collocation types found in a corpus, while a lower RTTR score means more recurrences of the same collocation types (Kim, 2020). Table 6 presents the degrees to which the same collocations are recycled in the textbook corpus and the reference corpus. Overall, the collocations of all subtypes in the textbooks are more repetitive than their counterparts in the reference corpus. The result is conflicting with previous studies which have shown that textbooks present insufficient repetition of target words or collocations (Kim, 2020;Koya, 2004;Tsai, 2015;Yu & Renandya, 2021). It is noticed that NNCs are recycled to a greater extent than VNCs and ANCs in the textbook corpus (24.71 for VNCs, 25.08 for ANCs, and 14.56 for NNCs). On the contrary, NNCs in the reference corpus were repeated less often than were VNCs and ANCs (30.56 for VNCs, 31.17 for ANCs, and 33.82 for NNCs). This difference in the relative repetition rates for NNCs between the two corpora may be because that noun + noun phrases are typical of formal and academic writing. They are usually technical terms or the embodiment of specific meanings (e.g., balance sheet). The topics in the textbooks hinge on pedagogical purpose while the themes in the reference corpus vary across different fields. In other words, the topics selected in the textbooks are those that the textbook authors consider useful and helpful to the students' L2 learning and are believed to satisfy the learners' needs. The written texts of the BNC, on the other hand, include academic books, specialist periodicals, and journals, etc., and thus cover a wider range of topics than the textbooks. It may be one of the reasons why NNCs repeat themselves less often in the reference corpus than in the textbooks.

Association Strength
The logDice is used for determining how strong the association is between the words in a collocation. A high score means that the collocate is more often found together with the node word as compared with other words. Table 7 displays the median logDice score of collocations in the textbook and reference corpora. The median is the middlemost value of the distribution, that is, the value that splits the data into halves: approximately half largest and half lowest (Mackey & Gass, 2016). It is commonly used when the data contain extreme scores (Mackey & Gass, 2016). In the present study, the median is more appropriate because the data contain collocations with very high logDice scores (greater than 11). The median logDice score for all the sub-types of collocations in the textbook corpus (6.85 for VNCs, 6.99 for NNCs, and 7.44 for ANCs) is higher than those in the reference corpus (6.1 for VNCs, 6.18 for NNCs, and 6.37 for ANCs). The association strength of collocations was further divided into five range bands of logDice scores: lower-mid (logDice=5~6.5, not including 6.5), mid (logDice=6.5~8, not including 8), uppermid (8~9.5, not including 9.5), high (9.5~11, not including 11), and very high (over 11). Figure 1 illustrates visually the distribution of all collocation types across logDice score bands in the textbooks and the native reference corpus. As shown in the Figure, collocations with the lowermid level of association strength take up the largest proportion in the textbook corpus and the native reference corpus (37.9% and 55.4% respectively). The reference corpus covers relatively more unique collocations with lower-level association strength ranging from 5 to 6.5 (55.4%) than the textbook corpus (37.9%), while the textbook corpus presents relatively more unique collocations toward the higher end of the association strength in comparison to the reference corpus. The result is in line with the median logDice scores of collocations in the two corpora. The median logDice score in the textbook corpus is higher than 6.5 while that in the reference corpus is lower than 6.5. The divergence between the textbook corpus and the reference corpus is that a majority of collocation types in the textbook corpus (53.4% in total) are distributed at mid-level (6.5-8.0) and upper-mid level (8.0-9.5) of association strength, which is higher than that in the BNC (42.3% in total). Meanwhile, the number of collocation types at the high and very high level of logDice scores over 9.5 take up 8.8% in the textbook corpus, which is much more than that in the BNC (2.4% in total).
The overall findings agree with Kim's (2020) results that a large majority of collocations in the Korean high school textbooks are distributed from the mid to high level (logDice scores over 6.5).
In addition, more types of collocations with association strength from high to very high levels are found in the textbooks than in the reference corpus. It indicates that the EFL textbooks present a majority of collocations which are more strongly associated while the reference corpus prefers collocations with relatively weak associations.

Discussion and conclusion
The result shows that collocations in the textbook corpus, on the whole, are much denser, more repetitive, and more strongly associated but less diversified in comparison to those in the native reference corpus.
In terms of the normalized frequency with differences in corpus size controlled for, overall, more collocations occur in the textbook corpus than in the native reference corpus. The main reason may reside in that the number of words in the textbooks is much smaller than that in the native reference corpus. Since the textbooks are purposely designed and used as the main source of L2 input, the textbook writers would present a much denser distribution of collocations within limited texts.
In spite of the denser use of collocations, by and large, the textbooks under-represented NNCs than did the native reference corpus. In other words, textbooks covered comparable tokens of VNCs and a lot more tokens of ANCs, which might lead to an unbalanced acquisition between NNCs on the one hand and VNCs as well as ANCs on the other hand. One plausible reason for the underrepresentation of NNCs is that NNCs are characteristic of formal or academic writing. Although the texts in the high school EFL textbooks are a kind of academic writing, the topics in the textbooks are selected to cater for the high school students while the topics in the BNC range from newspapers, and specialist journals to academic books. In addition, more proficient learners have been shown to use noun + noun phrases more frequently than less proficient learners (Parkinson & Musgrave, 2014). NNCs might thus have been considered by textbook writers and publishers to be less useful or less important for secondary school students with relatively low proficiency levels.
Collocation diversity was calculated by the number of collocation types divided by the square root of total word tokens to minimize the corpus size effect. The collocations in the textbooks are not as diversified as in the native reference corpus, which is contrary to previous studies that have shown higher collocation diversity in textbooks (Kim, 2020;Koya, 2004;Tsai, 2015). Kim (2020) reported that English textbooks for Korean middle and high school students contained more collocation types of VNCs, ANCs and NNCs than the native reference corpus. More collocation types were found in the English textbooks used in Japan than in the native history textbooks in Koya's (2004) study. In a similar vein, Tsai (2015) found that VNCs in the textbooks were more diversified than in the native reference corpus. However, it does not necessarily imply that the textbooks should present more various collocations. Romer (2004, p.161) claims that "learning a variety of English they will rarely encounter in real-life situations is very unlikely to help learners communicate successfully with competent speakers of English". In other words, useful and important vocabulary items and collocations should be presented in the textbooks. Therefore, it is unrealistic and unnecessary to include all collocation types used by native speakers in textbooks.
The highly repetitive co-occurrence of the same collocations may account for the higher collocation density and less diversity in the textbooks. In other words, the textbooks seem to achieve enhanced repetition at the cost of diversity in terms of collocations. This might be beneficial for L2 learners to the extent that repetition plays a conducive role in language learning. According to Ellis (2001), repetition of sequences allows their consolidation and retention in the long-term memory. It remains inconclusive about how many exposures are required for collocation learning, however. It is claimed in some research that vocabulary learning needs more than six or seven instances of repetition (Peters, 2014;Webb, 2007). Durrant and Schmit (2010) propose that collocations need to be recycled at least 8-10 times to be acquired. More repetition of collocation types in the textbooks would be conducive to the learners' mastery of collocations.
The spread of collocational strength shows that strongly associated collocations with higher logDice scores are dominant in the textbooks while collocations with comparatively lower logDice scores dominate in the reference corpus. The reason for this difference may be that textbook writers have ascribed greater importance to these strongly associated collocations probably due to greater salience in the input.
The divergences of the four distributional properties between the textbooks and the native reference corpus do not imply that there are many disadvantages in the textbooks. The textbooks are designed and written to equip students with the L2 language at their disposal. The pedagogical purpose of textbooks should be borne in mind. Textbooks do not need to resemble the native texts in all aspects, e.g., the distribution of VNCs, ANCs and NNCs that could be found in the native reference corpus. However, there is still some room for improvement in textbooks. It may be more helpful to the learners if textbook writers could incorporate more diverse collocations to model more native-like texts in the textbooks. Gitaski (1996) contended that inadequate exposure to specific types of collocations in the textbooks has contributed to the learners' avoidance of these types. Therefore, more coverage of NNCs in the textbooks targeting higher-grade learners may benefit their writing.