Studies in Language Testing Volume 18 2001
Studies in Language Testing Volume 18 2001
This book is divided into four sections. Section One contains two papers exploring general
matters of current concern to language testers worldwide, including technical, political and in a global
ethical issues. Section Two presents a set of six research studies covering a wide range of
contemporary topics in the field: the value of qualitative research methods in language test Context
development and validation; the contribution of assessment portfolios; the validation of
questionnaires to explore the interaction of test-taker characteristics and L2 performance: Proceedings of the ALTE
rating issues arising from computer-based writing assessment; the modeling of factors
affecting oral test performance; and the development of self-assessment tools. Section
Barcelona Conference,
Three takes a specific European focus and presents two papers summarising aspects of the July 2001
ongoing work of the Council of Europe and the European Union in relation to language
policy. Section Four develops the European focus further by reporting work in progress on
test development projects in various European countries, including Germany, Italy, Spain
and the Netherlands.
Its coverage of issues with both regional and global relevance means this volume will be of Edited by
interest to academics and policymakers within Europe and beyond. It will also be a useful
Michael Milanovic
resource and reference work for postgraduate students of language testing.
and Cyril J Weir
Aslo available:
Studies in
Multilingualism and Assessment: Language
Achieving transparency, assuring quality, sustaining diversity Testing
ISBN: 978 0 521 71192 0
Language Testing Matters:
Investigating the wider social and educational impact of assessment
18
ISBN: 978 0 521 16391 0 Series Editors
Michael Milanovic
and Cyril J Weir
European language testing
in a global context
Proceedings of the ALTE Barcelona Conference
July 2001
Also in this series:
An investigation into the comparability of two tests of English as a Foreign Language:
The Cambridge-TOEFL comparability study
Lyle F. Bachman, F. Davidson, K. Ryan, I.-C. Choi
Test taker characteristics and performance: A structural modeling approach
Antony John Kunnan
Performance testing, cognition and assessment: Selected papers from the 15th Language
Testing Research Colloquium, Cambridge and Arnhem
Michael Milanovic, Nick Saville
The development of IELTS: A study of the effect of background knowledge on reading
comprehension
Caroline Margaret Clapham
Verbal protocol analysis in language testing research: A handbook
Alison Green
A multilingual glossary of language testing terms
Prepared by ALTE members
Dictionary of language testing
Alan Davies, Annie Brown, Cathie Elder, Kathryn Hill, Tom Lumley, Tim McNamara
Learner strategy use and performance on language tests: A structural equiation
modelling approach
James Enos Purpura
Fairness and validation in language assessment: Selected papers from the 19th Language
Testing research Colloquium, Orlando, Florida
Antony John Kunnan
Issues in computer-adaptive testing of reading proficiency
Micheline Chalhoub-Deville
Experimenting with uncertainty: Essays in honour of Alan Davies
A.Brown, C. Elder, N. Iwashita, E. Grove, K. Hill, T. Lumley, K. OLoughlin, T. McNamara
An empirical investigation of the componentiality of L2 reading in English for academic
purposes
Cyril Weir
The equivalence of direct and semi-direct speaking tests
Kieran OLoughlin
A qualitative approach to the validation of oral language tests
Anne Lazaraton
Continuity and Innovation: Revising the Cambridge Proficiency in English Examination
1913 2002
Edited by Cyril Weir and Michael Milanovic
Unpublished
The development of CELS: A modular approach to testing English Language Skills
Roger Hawkey
Testing the Spoken English of Young Norwegians: a study of testing validity and the role
of smallwords in contributing to pupils fluency
Angela Hasselgren
Changing language teaching through language testing: A washback study
Liying Cheng
European language testing
in a global context
Proceedings of the ALTE Barcelona Conference
July 2001
PUBLISHED BY THE PRESS SYNDICATE OF THE UNIVERSITY OF CAMBRIDGE
The Pitt Building, Trumpington Street, Cambridge CB2 1RP, UK
CAMBRIDGE UNIVERSITY PRESS
The Edinburgh Building, Cambridge CB2 2RU, UK
40 West 20th Street, New York, NY 10011 4211, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
Ruiz de Alarcn 13, 28014 Madrid, Spain
Dock House, The Waterfront, Cape Town 8001, South Africa
http://www.cambridge.org
UCLES 2004
A catalogue record for this book is available from the British Library
Section One
Issues in Language Testing
Section Two
Research Studies
v
Section Three
A European View
Section Four
Work in Progress
vi
Series Editors note
The conference papers presented in this volume represent a small subset of the
many excellent presentations made at the ALTE conference European
Language Testing in a Global Context held in July 2001 in Barcelona in
celebration of the European Year of Languages 2001. They have been
selected to provide a flavour of the issues that the conference addressed. A
full listing of all presentations is attached at the end of this note.
The volume is divided into three parts. The first, with two papers, one
written by Charles Alderson and the other by Antony Kunnan, has a focus on
more general issues in Language Testing.
Alderson looks at some key issues in the field; he considers the shape of
things to come and asks if it will be the normal distribution. Using this pun
to structure his paper, he focuses on two aspects of language testing; the first
relates to the technical aspects of the subject (issues of validity, reliability,
impact etc.), the second relates to ethical and political concerns.
Most of his paper chimes well with current thinking on the technical
aspects and, as he admits, much of what he presents is not new and is
uncontroversial. Within the European context he refers to the influential work
of the Council of Europe, especially the Common European Framework and
the European Language Portfolio; he describes a number of other European
projects, such as DIALANG and the national examination reform project in
Hungary, and he praises various aspects of the work of ALTE (e.g. for its
Code of Practice, for organising useful conferences, for encouraging
exchange of expertise among its members, and for raising the profile of
language testing in Europe).
In focusing on the political dimension, however, he positions himself as
devils advocate and sets out to be provocative perhaps deliberately
introducing a negative skew into his discussion. As always his contribution
is stimulating and his conclusions are certainly controversial, particularly his
criticism of ALTE and several other organisations. These conclusions would
not go unchallenged by many ALTE members, not least because he
misrepresents the nature of the association and how it operates.
Kunnans paper discusses the qualities of test fairness and reflects his
longstanding concerns with the issues involved in this area. The framework
he presents is of great value to the field of Language Testing and Kunnan has
contributed significantly to the on-going debate on the qualities of test
fairness within ALTE.
The second part of the volume presents a number of research studies. Anne
vii
Lazaraton focuses on the use of qualitative research methods in the
development and validation of language tests. Lazaraton is a pioneer of
qualitative research in language testing and her involvement dates back to the
late eighties and early nineties when such approaches were not yet widely
used in the field. It is in part due to her efforts that researchers are now more
willing to embrace approaches that can provide access to the rich and deep
data of qualitative research. Readers are encouraged to look at her volume in
this series (A Qualitative Approach to the Validation of Oral Language Tests).
Vivien Berry and Jo Lewkowicz focus on the important issue of
compulsory language assessment for graduating students in Hong Kong.
Their paper considers alternatives to using a language test alone for this
purpose and looks at the applicability of variations on the portfolio concept.
Jim Purpuras work on the validation of questionnaires, which addresses the
interaction of personal factors and second language test performance,
represents an interesting and challenging dimension of validation in language
testing. Readers may also wish to refer to Purpuras volume in this series
(Learner strategy use and performance on language tests: A structural
equation modelling approach), which looks in more depth at the development
of questionnaires to determine personal factors and a methodology that can be
used to investigate their interactions with test performance.
Annie Browns paper is particularly relevant as we move towards greater
use of computers in language testing. Such a move is of course fraught with
issues, not least of which is the one of legibility that Brown addresses here.
Her findings are interesting, giving us pause for thought and indicating, as she
suggests, that more research is required. In the context of IELTS, such
research is currently being conducted in Cambridge.
Barry OSullivans paper attempts to model the factors affecting oral test
performance, an area of particular significance in large-scale assessment. The
paper is part of on-going research commissioned by the University of
Cambridge Local Examinations Syndicate and it is hoped that a collection of
research studies into the dimensions of oral assessment will be published in
this series in due course.
Finally, Sari Luomas paper looks at self-assessment in the context of
DIALANG. The DIALANG project, also referred to in Aldersons paper, has
been one of the key initiatives of the European Commission in relation to
language testing. As such it has benefited from significant funding and
generated much research potential.
The last two parts of the volume cover aspects of work in progress. On the
one hand, Joe Shiels and Wolfgang Mackiewicz summarise aspects of the on-
going work of the Council of Europe and the European Union in relation to
language policy. On the other, a number of researchers bring us up-to-date
with test development work largely, though not exclusively, in the context of
ALTE. These papers provide the reader with a reasonable overview of what
is going on in a number of European countries.
viii
In the context of the conference reflected in this volume, it is appropriate
to overview how ALTE has developed over the years and what is of particular
concern to the members of ALTE at the moment.
ALTE has been operating for nearly a decade and a half. It was first formed
when a few organisations, acknowledging the fact that there was no obvious
forum for the discussion of issues in the assessment of ones own language as
a foreign language in the European context, decided to meet with this aim in
mind. The question of language assessment generally is an enormous one and
dealt with in different ways by national and regional authorities throughout
Europe and the world. Trying to bring together such a large and diverse
community would have been a very significant task and far beyond the scope
of ALTEs mission. ALTEs direct interests and aims are on a much smaller
scale and it important to underline that it seeks to bring together those
interested in the assessment of their own language as a foreign language. This
is often in an international context, particularly with the more widely spoken
languages but also in a national context, as is the case with lesser spoken
languages in particular. While some ALTE members are located within
ministries or government departments, others are within universities and
cultural agencies. The members of ALTE are part of the international
educational context and ALTE itself, as well as the members that form it, is a
not-for-profit organisation. As a group, ALTE aims to provide a benchmark
of quality in the particular domain in which it operates. Should ALTEs work
be of relevance outside its own context, then so much the better, but ALTE
does not set out to establish or police the standard for European language
assessment in general.
The recent history of language testing in the European context is very
mixed. In the case of English we are fortunate that there has been significant
interest and research in this field in English speaking countries for many
years. In relation to some other European languages this is not the case. ALTE
recognises that the field of language testing in different languages will be at
different stages of development and that developing a language testing
capacity in the European context, albeit in a relatively narrow domain, is an
on-going venture. Similarly, progress, in contexts where participants are free
to walk away at any time, cannot be achieved through force or coercion but
rather through involvement, greater understanding and personal commitment.
ALTE operates as a capacity builder in the European context, albeit in a
relatively narrow domain.
As with any association, ALTE has a Secretariat, based in Cambridge and
elected by the membership. The Secretariat has a three-year term of office and
is supported by a number of committees, made up from the membership, who
oversee various aspects of ALTEs work. The group is too large for all
members to be involved in everything and there are a number of sub-groups,
organised by the members and focusing on particular areas of interest. The
sub-groups are formed, reformed and disbanded as circumstances and
ix
interests dictate, and at the moment there are several active ones. We will
briefly describe the work of some of these here.
The whole of ALTE has been working for some time on the ALTE
Framework which seeks to place the examinations of ALTE members onto a
common framework, related closely through empirical study, to the Common
European Framework. The process of placing examinations on the framework
is underpinned by extensive work on the content analysis of examinations,
guidelines for the quality production of examinations and empirically
validated performance indicators in many European languages. This work has
been supported by grants from the European Commission for many years and
is now being taken forward by a number of sub-groups which are considering
different domains of use such as language specifically for work purposes, for
young learners or for study through the medium of a language.
A group has been established to look at the extent to which teacher
qualifications in different languages can be harmonised and placed on some
kind of framework. The group is not looking specifically at state organised
qualifications but rather those common in the private sector for example,
those offered by The Alliance Francaise, the Goethe Institute, the Cervantes
Institute or Cambridge amongst others. It seeks to provide greater flexibility
and mobility for the ever growing body of language teachers often qualified
in one language and wishing to teach another while having their existing
qualifications recognised as contributing to future ones in a more systematic
way than is possible at present.
The Council of Europe has made and continues to make a substantial
contribution to the teaching, learning and assessment of languages in the
European context and in recent years has developed the concept of the
European Language Portfolio as an aid and support to the language learning
and teaching community. ALTE and the European Association for Quality
Language Services have collaborated on the development of a portfolio for
adults, which is now in the public domain. It is hoped that this will be a
valuable aid to adult learners of languages in the European context.
An ALTE sub-group has been working with the Council of Europe and
John Trim in the elaboration of a Breakthrough level which would
complement the Waystage, Threshold and Vantage levels already developed.
ALTEs work in this area has also been supported by the European
Commission in the form of funding to a group of members from Finland,
Ireland, Norway, Greece and Sweden who have a particular interest in
language teaching and testing at the Breakthrough level.
Another ALTE sub-group has been working on the development of
a multilingual system of computer-based assessment. The approach, which is
based on the concept of computer adaptive testing, has proved highly
successful and innovative, providing assessment in several European
languages and recently won the European Academic Software award
in 2000.
x
ALTE members have developed a multilingual glossary of language
testing terms. Part of this work has been published in this series (A
multilingual glossary of language testing terms) but is ongoing, and as new
languages join ALTE, further versions of the glossary are being developed.
The glossary has allowed language testers in about 20 countries to define
language testing terms in their own language and thus contributes to the
process of establishing language testing as a discipline in its own right. The
European Commission has supported this work throughout.
In the early 1990s, ALTE developed a code of professional practice and
work has continued to elaborate the concept of quality assurance in language
testing through the development of quality assurance and quality management
instruments for use initially by ALTE members. This work has been in
progress for several years and is now in the hands of an ALTE sub-group. As
noted above, developing the concept of quality assurance and its management
has to be a collaborative venture between partners and is not prone to
imposition in the ALTE context. ALTE members are aware that they carry
significant responsibility and aim to continue to play a leading role in defining
the dimensions of quality and how an effective approach to quality
management can be implemented. This work is documented and has been
elaborated in ALTE News as well as at a number of international conferences.
Details are also available on the ALTE website: www.alte.org.
Members of ALTE are also concerned to measure the impact of their
examinations and work has gone on in the context of ALTE to develop a
range of instrumentation to measure impact on stakeholders in the test taking
and using constituency. Roger Hawkey discusses the concept of impact in the
context of the Lingua 2000 project in one of the papers in this volume see
contents page.
ALTE members meet twice a year and hold a language testing conference
in each meeting location. This is an open event, details of which are available
on the ALTE website. New ALTE members are elected by the membership as
a whole. Members are either full from countries in the European Union or
associate from countries outside. For organisations which do not have the
resources to be full or associate members or who operate in a related field,
there is the option of observer status. Information on all of these categories of
membership is available on the ALTE website.
Finally, following the success of the Barcelona conference ALTE has
agreed to organise another international conference in 2005. Details are
available on the website.
Mike Milanovic
Cyril Weir
March 03
xi
Presentations at ALTE Conference Barcelona, 2001
xii
Margaretha Corell and Thomas Wrigstad Ina Ferbezar and Marko Stabej
Stockholm University Department of University of Ljubljana, Centre for Slovene
as a Second/Foreign Language and
Scandinavian Languages and Centre for Department of Slavic Languages and
Research on Bilingualism. Literature, Slovenia
Whats the difference? Analysis of two paired Developing and Implementing Language Tests
conversations in the Oral examination of the in Slovenia
National Tests of Swedish as a
Jsus Fernndez and Clara Maria de Vega
Second/Foreign Language
Santos
Ben Csap and Marianne Nikolov Universidad de Salamanca, Spain
University of Szeged and University of Pcs, Advantages and disadvantages of the Vantage
Hungary Level: the Spanish version (presentation in
Hungarian students performances on English Spanish)
and German tests
Neus Figueras
John H.A.L. de Jong Generalitat de Catalunya, Department
Language Testing Services, The dEnsenyament, Spain
Netherlands Bringing together teaching and testing for
Procedures for Relating Test Scores to certification. The experience at the Escoles
Council of Europe Framework Oficials dIdiomes
xiii
Anne Gutch Sue Kanburoglu Hackett and Jim Ferguson
UCLES, UK The Advisory Council for English
A major international exam: The revised CPE Language Schools Ltd, Ireland
Interaction in context: a framework for
H.I. Hacquebord and S.J. Andringa
assessing learner competence in action
University of Groningen, Applied
Linguistics, The Netherlands Lucy Katona
Testing text comprehension electronically Idegennyelvi Tovbbkpz Kzpont (ITK),
Hungary
Roger Hawkey
The development of a communicative oral
c/o UCLES, UK
rating scale in Hungary
Progetto Lingue 2000: Impact for Language
Friendly Schools Antony John Kunnan
Assoc. Prof., TESOL Program, USA
Nathalie Hirschprung
Articulating a fairness model
Alliance Franaise, France
Teacher certifications produced by the Rita Kursite
Alliance Franaise in the ALTE context Jaunjelgava Secondary School, Latvia
Analyses of Listening Tasks from Different
Maria Iakovou
Points of View
University of Athens, School of Philosophy,
Greece Michel Laurier and Denise Lussier
The teaching of Greek as a foreign language: University of Montreal Faculty of
Reality and perspectives Education and McGill University
The development of French language tests
Miroslaw Jelonkiewicz
based on national benchmarks
Warsaw University and University of
Wroclaw, Poland Anne Lazaraton
Describing and Testing Competence in Polish University of Minnesota, USA
Culture Setting standards for qualitative research in
language testing
Miroslaw Jelonkiewicz
Warsaw University, Poland Jo Lewkowicz
Describing and gauging competence in Polish The University of Hong Kong, The English
culture Centre, Hong Kong
Stakeholder perceptions of the text in reading
Neil Jones
comprehension tests
UCLES, UK
Using ALTE Can-Do statements to equate Sari Luoma
computer-based tests across languages University of Jyvskyl
Self-assessment in DIALANG
Neil Jones, Henk Kuijper and Angela
Verschoor Denise Lussier
UCLES, Citogroep, UK, The Netherlands McGill University, Canada
Relationships between paper and pencil tests Conceptual Framework in Teaching and
and computer based testing Assessing Cultural Competence
xiv
Wolfgang Mackiewicz Christine Pegg
Freie Universitt Berlin, Germany Cardiff University, Centre for Language
Higher education and language policy in the and Communication Research, UK
European Union Lexical resource in oral interviews: Equal
assessment in English and Spanish?
Waldemar Martyniuk
Jagiellonian University, Poland Mnica Perea & Llus Rfols
Polish for Europe Introducing Certificates in Generalitat de Catalunya
Polish as a Foreign Language The new Catalan examination system and the
examiners training
Lydia McDermott
University of Natal, Durban Juan Miguel Prieto Hernndez
Language testing, contextualised needs and Universidad de Salamanca, Spain
lifelong learning Problemas para elaborar y evaluar una prueba
de nivel: Los Diplomas de Espaol como
Debie Mirtle
Lengua Extranjera
MATESOL, Englishtown, Boston, USA
Online language testing: Challenges, James E. Purpura
Successes and Lessons Learned Columbia Teachers College, USA
Developing a computerised system for
Lelia Murtagh
investigating non-linguistic factors in L2
IT
learning and test performances
Assessing Irish skills and attitudes among
young adult secondary school leavers John Read
Victoria University of Wellington, New
Marie J. Myers
Zealand
Queens University, Canada
Investigating the Impact of a High-stakes
Entrance assessments in teacher training: a
International Proficiency Test
lesson of international scope
Diana Rumpite
Barry OSullivan
Riga Technical University, Latvia
The University of Reading, School of
Innovative tendencies in computer based
Linguistics and Applied Language Studies,
testing in ESP
UK
Modelling Factors Affecting Oral Language Raffaele Sanzo
Test Performance: An empirical study Ministero della Pubblica Istruzione, Italy
Foreign languages within the frame of Italian
Silvia Mara Olalde Vegas and Olga Juan
educational reform
Lzaro
Instituto Cervantes, Espaa Joseph Sheils
Spanish distance-learning courses: Follow-up Modern Languages Division, Council of
and evaluation system Europe
Council of Europe language policy and the
promotion of plurilingualism
xv
Elana Shohamy Lynda Taylor
Tel Aviv University, School of Education, UCLES, UK
Israel Revising instruments for rating speaking:
The role of language testing policies in combining qualitative and quantitative insights
promoting or rejecting diversity in
John Trim
multilingual/multicultural societies
Project Director for Modern Languages,
Kari Smith Council of Europe
Oranim Academic College of Education, The Common European Framework of
Israel Reference for Languages and its implications
Quality assessment of Quality Learning: The for language testing
digital portfolio in elementary school
Philippe Vangeneugden and Frans van der
M. Dolors Sol Vilanova Slik
Generalitat de Catalunya, Centre de Katholieke Universiteit Nijmegen, The
Recursos de Llenges Estrangeres of the Netherlands and Katholieke Universiteit
Department of Education Leuven, Belgium
The effectiveness of the teaching of English in Towards a profile related certification
the Baccalaureate school population in structure for Dutch as a foreign language.
Catalonia. Where do we stand? Where do we Implications of a needs analysis for profile
want to be? selection and description
xvi
Poster Presentations
Guy Bentner & Ines Quaring
Centre de Langues de Luxembourg
Tests of Luxembourgish as a foreign language
Jos Pascoal
Universidade de Lisboa, Portugal
Tests of Portuguese as a foreign language
Heinrich Rbeling
WBT, Germany
Test Arbeitsplatz Deutsch a workplace
related language test in German as a foreign
language
Lszlo Szabo
Eotvos Lorand University, Budapest,
Centre for Foreign languages, Hungary
Hungarian as a foreign language examination
xvii
List of contributors
xviii
Section 1
Issues in
Language Testing
1
The shape of things to come:
will it be the normal
distribution?
Charles Alderson
Department of Linguistics and Modern English Language
Lancaster University
Introduction
In this paper I shall survey developments in language testing over the past
decade, paying particular attention to new concerns and interests. I shall
somewhat rashly venture some predictions about developments in the field
over the next decade or so and explore the shape of things to come.
Many people see testing as technical and obsessed with arcane procedures
and obscure discussions about analytic methods expressed in alphabet soup,
such as IRT, MDS, SEM and DIF. Such discourses and obsessions are alien to
teachers, and to many other researchers. In fact these concepts are not
irrelevant, because many of them are important factors in an understanding of
our constructs what we are trying to test. The problem is that they are often
poorly presented: researchers talking to researchers, without being sensitive to
other audiences who are perhaps less obsessed with technical matters.
However, I believe that recent developments have seen improved
communication between testing specialists and those more generally
concerned with language education which has resulted in a better
understanding of how testing connects to peoples lives.
Much of what follows is not necessarily new, in the sense that the issues
have indeed been discussed before, but the difference is that they are now
being addressed in a more critical light, with more questioning of assumptions
and by undertaking more and better empirical research.
1
1 The shape of things to come: will it be the normal distribution?
2
1 The shape of things to come: will it be the normal distribution?
3
1 The shape of things to come: will it be the normal distribution?
practice, and since tests often have a prescriptive or normative role, then their
social consequences are potentially far-reaching. In the light of such impact,
he proposes the need for a professional morality among language testers, both
to protect the professions members and to protect the individual within
society from misuse and abuse of testing instruments. However, he also argues
that the morality argument should not be taken too far, lest it lead to
professional paralysis, or cynical manipulation of codes of practice.
A number of case studies illustrate the use and misuse of language tests.
Two examples from Australia (Hawthorne 1997) are the use of the access test
to regulate the flow of migrants into Australia, and the step test, allegedly
designed to play a central role in the determining of asylum seekers
residential status. Similar misuses of the IELTS test to regulate immigration
into New Zealand are also discussed in language testing circles but not yet
published in the literature. Perhaps the new concern for ethical conduct will
result in more whistle-blowing accounts of such misuse. If not, it is likely to
remain so much hot air.
Nevertheless, an important question is: to what extent are testers
responsible for the consequences, use and misuse of their instruments? To
what extent can test design prevent misuse? The ALTE Code of Practice is
interesting, in that it includes a brief discussion of test developers
responsibility to help users to interpret test results correctly, by providing
reports of results that describe candidate performance clearly and accurately,
and by describing the procedures used to establish pass marks and/or grades.
If no pass mark is set, ALTE members are advised to provide information that
will help users set pass marks when appropriate, and they should warn users
to avoid anticipated misuses of test results.
Despite this laudable advice, the notion of consequential validity is in my
view highly problematic because, as washback research has clearly shown,
there are many factors that affect the impact a test will have, and how it will
be used, misused and abused. Not many of these can be attributed to the test,
or to test developers, and we need to demarcate responsibility in these areas.
But, of course, the point is well taken that testers should be aware of the
consequences of their tests, and should ensure that they at least behave
ethically. Part of ethical behaviour, I believe, is indeed investigating, not just
asserting, the impact of the tests we develop.
Politics
Clearly, tests can be powerful instruments of educational policy, and are
frequently so used. Thus testing can be seen, and increasingly is being seen,
as a political activity, and new developments in the field include the relation
between testing and politics, and the politics of testing (Shohamy 2001).
But this need not be only at the macro-political level of national or local
4
1 The shape of things to come: will it be the normal distribution?
5
1 The shape of things to come: will it be the normal distribution?
National tests
One of the reasons we will hear a great deal about the Common European
Framework in the future is because of the increasing need for mutual
recognition and transparency of certificates in Europe, for reasons of
educational and employment mobility. National language qualifications, be
they provided by the state or by quasi-private organisations, vary enormously
in their standards both quality standards and standards as levels.
International comparability of certificates has become an economic as well as
an educational imperative, and the availability of a transparent, independent
6
1 The shape of things to come: will it be the normal distribution?
7
1 The shape of things to come: will it be the normal distribution?
8
1 The shape of things to come: will it be the normal distribution?
9
1 The shape of things to come: will it be the normal distribution?
10
1 The shape of things to come: will it be the normal distribution?
having learned something about their first performance and thus is closer to
current ability.
Computers can also be user-friendly in offering a range of support to test
takers: on-line Help facilities, clues, tailor-made dictionaries and more, and
the use of such support can be monitored and taken into account in calculating
test scores. Users can be asked how confident they are that the answer they
have given is correct, and their confidence rating can be used to adjust the test
score. Self-assessment and the comparison of self-assessment with test
performance is an obvious extension of this principle of asking users to give
insights into their ability. Similarly, adaptive tests need not be merely
psychometrically driven, but the user could be given the choice of taking
easier or more difficult items, especially in a context where the user is given
immediate feedback on their performance. Learners can be allowed to choose
which skill they wish to be tested on, or which level of difficulty they take a
test at. They can be allowed to choose which language they wish to see test
rubrics and examples in, and in which language results and feedback are to be
presented.
An example of computer-based diagnostic tests, available over the Internet,
which capitalises on the advantages I have mentioned, is DIALANG (see
Chapter 8 by Sari Luoma, page 143). DIALANG uses self-assessment as an
integral part of diagnosis, asking users to rate their own ability. These ratings
are used in combination with objective techniques in order to decide which
level of test to deliver to the user. DIALANG provides immediate feedback to
users, not only on scores, but also on the relationship between their test results
and their self-assessment. DIALANG also gives extensive explanatory and
advisory feedback on test results. The language of administration, of self-
assessment, and of feedback, is chosen by the test user from a list of 14
European languages, and users can decide which skill they wish to be tested
in, in any one of 14 European languages.
One of the claimed advantages of computer-based assessment is that
computers can store enormous amounts of data, including every keystroke
made by candidates and their sequence and the time taken to respond to a task,
as well as the correctness of the response, the use of help, clue and dictionary
facilities, and much more. The challenge is to make sense of this mass of data.
A research agenda is needed.
What is needed above all is research that will reveal more about the validity
of the tests, that will enable us to estimate the effects of the test method and
delivery medium; research that will provide insights into the processes and
strategies test takers use; studies that will enable the exploration of the
constructs that are being measured, or that might be measured. Alongside
development work that explores how the potential of the medium might best
be harnessed in test methods, support, diagnosis and feedback, we need
research that investigates the nature of the most effective and meaningful
11
1 The shape of things to come: will it be the normal distribution?
12
1 The shape of things to come: will it be the normal distribution?
13
1 The shape of things to come: will it be the normal distribution?
addressed. My own wish list for the future of language testing would include
more accounts by developers (along the lines of Alderson, Nagy, and veges
2000) of how tests were developed, and of how constructs were identified,
operationalised, tested and revised. Such accounts could contribute to the
applied linguistic literature by helping us understand these constructs and the
issues involved in operationalisation in validating, if you like, the theory.
Pandoras boxes
Despite what I have said about the Bachman Model, McNamara has opened
what he calls Pandoras box (McNamara 1995). He claims that the problem
with the Bachman Model is that it lacks any sense of the social dimension of
language proficiency. He argues that it is based on psychological rather than
socio-psychological or social theories of language use, and he concludes that
we must acknowledge the intrinsically social nature of performance and
examine much more carefully its interactional i.e. social aspects. He asks
the disturbing question: whose performance are we assessing? Is it that of the
candidate? Or the partner in paired orals? Or the interlocutor in one-to-one
tests? The designer who created the tasks? Or the rater (and the creator of the
criteria used by raters)? Given that scores are what is used in reporting results,
then a better understanding of how scores are arrived at is crucial. Research
has intensified into the nature of the interaction in oral tests and I can
confidently predict that this will continue to be a fruitful area for research,
particularly with reference to performance tests.
Performance testing is not in itself a new concern, but is a development
from older concerns with the testing of speaking. Only recently, however,
have critiques of interviews made their mark. It has been shown through
discourse analysis that the interview is only one of many possible genres of
oral task, and it has become clear that the language elicited by interviews is
not the same as that elicited by other types of task, and by different sorts of
social interaction which do not have the asymmetrical power relations of the
formal interview. Thus different constructs may be tapped by different tasks.
Hill and Parry (1992) claim that traditional tests of reading assume that
texts have meaning, and view text, reader and the skill of reading itself as
autonomous entities. In contrast, their own view of literacy is that it is socially
constructed, and they see the skill of reading as being much more than
decoding meaning. Rather, reading is the socially structured negotiation of
meaning, where readers are seen as having social, not just individual,
identities. Hill and Parrys claim is that this view of literacy requires an
alternative approach to the assessment of literacy that includes its social
dimension. One obvious implication of this is that what it means to understand
a text will need to be revisited. In the post-modern world, where multiple
14
1 The shape of things to come: will it be the normal distribution?
15
1 The shape of things to come: will it be the normal distribution?
16
1 The shape of things to come: will it be the normal distribution?
17
1 The shape of things to come: will it be the normal distribution?
scores, and we know that different test methods will produce different
measures of comprehension of the same text. This is why we advocate testing
reading comprehension using multiple texts and multiple test methods. In
other words, we do not expect high item correlations, and arguably a low
Cronbach alpha would be a measure of the validity of our test and a high
reliability coefficient would suggest that we had not incorporated items that
were sufficiently heterogeneous.
A recent study (Swain 1993) addresses this issue from a second-language-
acquisition perspective. Swain studies tests designed to measure the various
aspects of communicative proficiency posited by the Canale-Swain/ Bachman
family of models. In order to validate the models, the researchers wished to
conduct factor analyses, which require high reliabilities of the constituent
tests. However, they found remarkably low reliabilities of component test
scores: scores for politeness markers correlated at .06 in requests, .18 in offers
and .16 in complaints. If all component scores were added together, to get a
more composite measure of reliability, the correlations between two
complaints was .06, two requests .14 and two offers .18. Even when averages
were computed for each student across all three speech acts and correlated
with their replications a form of split-half correlation that coefficient was
only .49. Swain comments: we succeeded in getting a rather low estimate of
internal consistency by averaging again and again in effect, by lengthening
the test and making it more and more complex. The cost is that information on
how learners performance varies from task to task has been lost (1993: 199).
Second-language acquisition research shows that variation in task
performance will be the norm, not the exception, and it may be systematic, not
random, affected by the complex interaction of various task characteristics.
Both testing and task research show that performance on supposedly similar
tasks varies. If variation in interlanguage is systematic, what does this imply
about the appropriateness of a search for internal test consistency? (op. cit.
204). Indeed one might wish to argue that a good test of second-language
proficiency must have a low internal consistency.
We are thus faced with a real problem in conceptualising reliability and
validity, and in knowing what statistical results to use as valid measures of test
quality, be that reliability, validity or other. Indeed Swain argues that we
would do well to search for meaningful quality criteria for the inclusion of
test tasks rather than rely so heavily on a measure of internal consistency. She
cites Linn et al.s suggestion (Linn, Baker and Dunbar 1991) that several such
criteria might be consequences, fairness, cognitive complexity, content quality
and content coverage. However, given the arguments I have put forward
earlier about the complexity of washback, the difficulty of being clear about
what cognitive operations are involved in responding to test tasks, the
difficulty that judges have in judging test content, and the possibility that
apparently unfair tests might be valid, we are clearly facing dilemmas in
18
1 The shape of things to come: will it be the normal distribution?
19
1 The shape of things to come: will it be the normal distribution?
20
1 The shape of things to come: will it be the normal distribution?
in fact a cartel, because it does not allow more than one body to represent any
language at a particular level. There are occasional exceptions where tests of
the full range of language ability are represented by two bodies, but that is
irrelevant to the argument that ALTE is an exclusive club where only one
organisation can represent any language at a given level. That in effect means
that no other examining body can join ALTE to represent English, since
English is covered by UCLES. And the Secretariat of ALTE is in Cambridge.
Imagine the power of UCLES/ Cambridge, then, in ALTE.
But there is a dilemma. Many ALTE members also produce tests in
languages other than their national language. CITOgroep, the Finnish National
Language Certificates, and the Hungarian State Foreign Languages
Examinations Board, to take just the three examples cited earlier, all produce
tests of French, German, English and more in addition to tests of their national
language. But they are officially not members of ALTE for those other
languages. The question then arises of the status and indeed quality of their
tests in those languages. Unscrupulous testing bodies could, if they wished,
give the impression to the outside world that being a member of ALTE
guarantees the quality not only of their national language exams, but also of
their foreign language exams a potentially very misleading impression
indeed in the case of some ALTE members and associates.
What is the rationale for this exclusivity? Especially in an age where the
power of the native speaker is increasingly questioned, where the notion of the
native-speaker norm has long been abandoned in language testing, and where
many organisations engage quite legitimately and properly in the testing of
languages of which they are not native speakers? I suggest that the notion of
exclusive membership is outdated as a legitimate concept. Rather it is retained
in order to ensure that UCLES be the sole provider within ALTE of exams in
the most powerful language in the world at present. This is surely a
commercial advantage, and I suspect, from conversations I have had, that
many ALTE members are not happy with this state of affairs, and wish the
playing field to be levelled. But ALTE remains a cartel.
Consider further the ALTE Code of Practice. It is an excellent document
and ALTE is rightly proud of its work. But there is no enforcement
mechanism; there is no way in which ALTE monitors whether its members
actually adhere to the Code of Practice, and membership of ALTE is not
conditional on applicants having met the standards set out in the Code. And
even if they did, the quality control would presumably apply only to the tests
of the national language for which the member was responsible. Thus the very
existence of the ALTE Code of Practice is something of an illusion: the user
and the test taker might believe that the Code of Practice is in force and
guarantees the quality of the exams of ALTE members, but that is simply not
true. ALTE does not operate like EAQUALS (the European Association for
Quality Language Services http://www.eaquals.org), which has a rigorous
inspection system that is applied to any applicant language school wishing to
21
1 The shape of things to come: will it be the normal distribution?
join the organisation. EAQUALS can claim that its members have met quality
standards. ALTE cannot.
Note that I have no doubt that ALTE has achieved a great deal. It has
developed not only a Code of Practice but also a framework for test levels
which, although it has now been superseded by the Council of Europes
Common European Framework, has contributed to raising international debate
about levels and standards. ALTE holds very useful conferences, encourages
exchange of expertise among its members, and has certainly raised the profile
of language testing in Europe. But the time has come for ALTE to question its
basic modus operandi, its conditions of membership and its role in the world
of professional language testing, and to revise itself.
ALTE urgently needs also to consider its impact. Not only the impact of its
own tests, which as I have already suggested ought to be researched by ALTE
members. But also the impact of its very existence on societies and on non-
members. In particular I am very concerned that ALTE is a powerful threat to
national or local examination authorities that cannot become members. This is
especially true of school-leaving examinations, which are typically developed
by governmental institutions, not testing companies. Although, as I have said,
many of these examinations are worthless, there are in many countries,
especially in East and Central Europe, serious attempts to reform national
school-leaving examinations. But there is evidence that ALTE members are
operating increasingly in competition with such national bodies.
In Italy, the Progetto Lingue 2000 (http://www.cambridge-efl.org/italia/
lingue2000/index.cfm) is experimenting with issuing certificates to
schoolchildren of external commercial exams. This is bound to have an
impact on the availability of locally produced, non-commercial exams.
Attention ought surely to be given to enhancing the quality and currency of
Italian exams of foreign languages. In Hungary, an ALTE associate member
offers language certificates (of unknown quality) that are recognised by the
state as equivalent to local non-commercial exams. And I understand that
Poland is experimenting with a similar system: Cambridge exams will be
recognised for school-leaving and university entrance purposes. But such
exams are not free to the user (in Italy, they are free at present, as the
educational authorities pay Cambridge direct for the entrance fees for the
exams, but this is clearly unsustainable in the long term).
One ethical principle that is not mentioned by testing companies is surely
that successful completion of free public education should be certified by
examinations that are free to the test taker, are of high quality, and which have
currency in the labour market and in higher education. This principle is
undermined if expensive examinations are allowed to replace free local
examination certificates.
What ALTE as a responsible professional body ought to be doing is helping
to build local capacity to deliver quality examinations, regardless of whether
those exam providers are members of ALTE or not. If ALTE exams replace
22
1 The shape of things to come: will it be the normal distribution?
Conclusion
To summarise first. We will continue to research washback and explore test
consequences, but we will not simply describe them; rather, we will try to
explain them. How does washback occur, why do teachers and students do
what they do, why are tests designed and exploited the way they are and why
are they abused and misused? We will burrow beneath the discourses to
understand the hidden agendas of all stakeholders. I would not put any money
on examination boards doing this, but I am willing to be surprised. We will
develop further our codes of practice and develop ways of monitoring their
implementation, not just with fine words but with action. We will develop
codes of ethical principles as well and expand our understanding of their
suitability in different cultural and political conditions.
The Common European Framework will grow in influence and as it is used
it will be refined and enhanced; there will be an exploration of finer-grained
sub-levels between the main six levels to capture better the nature of learning,
and through that we will develop a better understanding of the nature of the
development of language proficiency, hopefully accompanied by empirical
research.
Within language education, tests will hopefully be used more and more to
help us understand the nature of achievement and what goes on in classrooms
both the process and the product. The greater availability of computer-based
testing, and its reduced and even insignificant cost will enhance the quality of
class-based tests for placement, for progress and for diagnosis.
We will continue to research our construct. Pandoras boxes will remain
open but the Model will be subject to refinement and enhancement as the
nature of skills and abilities, of authenticity, and of the features of tasks are
explored and are related more to what we know from second-language
acquisition, just as second-language acquisition research and especially
research methodology, will learn from testers.
We will continue in frustration to explore alternative assessment; the
European Language Portfolio will flourish and may even be researched. We
will explore an enhanced view of validity, with less stress on reliability, as we
focus more and more on the individual learner, not simply on groups of test-
23
1 The shape of things to come: will it be the normal distribution?
takers, as we try to understand performances, not just scores. But this will only
happen if we do the research, if we learn from the past and build on old
concerns. Developing new fads and pseudo-solutions is counter-productive,
and ignoring what was written and explored ten years and more ago will not
be a productive way forward. We must accumulate understanding, not pretend
we have made major new discoveries.
What does all this have to do with my title?
My title is intended to focus on learning-related assessment and testing: on
diagnosis, placement, progress-testing and an examination not just of why we
have not made much progress in the area of achievement testing, pace the
developments in Central Europe, but also of what the implications of test
development analysis and research are for attention to low-stakes assessment
that is learning- and individual-learner-related, that need not (cannot?) meet
normal standards of reliability, that may not produce much variance, or that
occurs where we do not expect a normal curve.
What I intend my title to emphasise is that we will be less obsessed in the
future with normal distributions, with standard traditional statistics, or indeed
with new statistical techniques, and more concerned to understand, by a
variety of means, what it is that our tests assess, what effect they have and
what the various influences on and causes of test design, test use and test
misuse are. Through innovations in test research methodology, together with
the opportunities afforded by computer-based testing for much friendlier test
delivery, easier data handling and more fine-tuned assessment, we will get
closer to the individual and closer to understanding individual performances
and abilities. These will be interpreted in finer detail in relation to performance
on individual tasks, which will be understood better as consisting of
complexes of task characteristics, and not as an assembly of homogeneous
items. The complexities of tasks, of performances and of abilities will be
better appreciated and attempts will be made to understand this complexity.
The Common European Framework already offers a framework within which
concerted research can take place; computer-based testing can enhance test
delivery and the meaning of results; and the relationship between alternative
assessments like the European Language Portfolio and test-based performance
can be explored and better understood.
My title also has an ambiguity: normal distribution of what? Not just of
test scores, but of power, of income and wealth, of access and opportunities,
of expertise and of responsibilities. A better distribution of all these is also
needed. We need more openness, more recognition of the quality of the work
of all, more concern to build the capacity of all, not just members of an
exclusive club. This, of course, will only happen if people want it to, if
research is encouraged by those with the power and the money, if there is less
self-interest, if there is greater co-operation and pooling of resources in a
common search for understanding, and less wasteful and harmful competition
and rivalry.
24
1 The shape of things to come: will it be the normal distribution?
References
Alderson, J. C. 1999. What does PESTI have to do with us testers? Paper
presented at the International Language Education Conference, Hong
Kong.
Alderson, J. C. 2000. Technology in testing: the present and the future.
System.
Alderson, J. C., and C. Clapham. 1992. Applied linguistics and language
testing: a case study. Applied Linguistics 13: 2, 14967.
Alderson, J. C., C. Clapham, and D. Wall. 1995. Language Test Construction
and Evaluation. Cambridge: Cambridge University Press.
Alderson, J. C., and L. Hamp-Lyons. 1996. TOEFL preparation courses: a
study of washback. Language Testing 13: 3, 280297.
Alderson, J. C., E. Nagy and E. veges (eds.). 2000. English language
education in Hungary, Part II: Examining Hungarian learners
achievements in English. Budapest: The British Council.
Alderson, J. C., and B. North. (eds.). 1991. Language Testing in the 1990s:
The Communicative Legacy. London: Modern English Publications in
association with The British Council.
Alderson, J. C., and D. Wall. 1993. Does washback exist? Applied Linguistics,
14: 2, 115129.
ALTE 1998. ALTE handbook of European examinations and examination
systems. Cambridge: UCLES.
Bachman, L. F. 1990. Fundamental Considerations in Language Testing.
Oxford: Oxford University Press.
Bachman, L. F., and A. S. Palmer. 1996. Language Testing in Practice.
Oxford: Oxford University Press.
Davies, A. 1997. Demands of being professional in language testing. Language
Testing 14: 3, 32839.
Hamp-Lyons, L. 1997. Washback, impact and validity: ethical concerns.
Language Testing 14: 3, 295303.
Hawthorne, L. 1997. The political dimension of language testing in Australia.
Language Testing 14: 3, 24860.
Hill, C., and K. Parry. 1992. The test at the gate: models of literacy in reading
assessment. TESOL Quarterly 26: 3, 43361.
Lewkowicz, J. A. 1997. Investigating Authenticity in Language Testing.
Unpublished Ph.D., Lancaster University, Lancaster.
Linn, R. L., E. L. Baker and S. B. Dunbar. 1991. Complex, performance-based
assessment: expectations and validation criteria. Educational Researcher,
20 November, 1521.
Luoma, S. 2001. What Does Your Language Test Measure? Unpublished
Ph.D., University of Jyvskyla, Jyvskyla.
25
1 The shape of things to come: will it be the normal distribution?
26
Test fairness
Abstract
The concept of test fairness is arguably the most critical in test evaluation but
there is no coherent framework that can be used for evaluating tests and testing
practice. In this paper, I present a Test Fairness framework that consists of the
following test qualities: validity, absence of bias, access, administration, and
social consequences. Prior to presenting this framework, I discuss early views
of test fairness, test evaluation in practice and ethics for language testing. I
conclude with some practical guidelines on how the framework could be
implemented and a discussion of the implications of the framework for test
development.
Introduction
The idea of test fairness as a concept that can be used in test evaluation has
become a primary concern to language-testing professionals today, but it may
be a somewhat recent preoccupation in the history of testing itself. Perhaps
this is so because of the egalitarian view that tests and examinations were
considered beneficial to society, as they helped ensure equal opportunity for
education and employment and attacked the prior system of privilege and
patronage. For this reason, tests and examinations have taken on the
characteristic of infallibility. But everyone who has taken a test knows that
tests are not perfect; tests and testing practices need to be evaluated too.
The first explicit, documented mention of a test quality was in the 19th
century, after competitive examinations had become entrenched in the UK.
According to Spolsky (1995), in 1858 a committee for the Oxford
examinations worked with the examiners to ensure the general consistency of
the examination as a whole (p. 20). According to Stigler (1986), Edgeworth
articulated the notion of consistency (or reliability) in his papers on error and
chance much later, influenced by Galtons anthropometric laboratory for
studying physical characteristics. As testing became more popular in the later
decades of the 19th century and early 20th century, modern measurement
27
2 Test fairness
28
2 Test fairness
unified and expanded view of validity (see Henning 1987; Hughes 1989;
Alderson, Clapham and Wall 1995; Genesee and Upshur 1996; and Brown
1996). Only Bachman (1990) presented and discussed Messicks unified and
expanded view of validity. Thus, validity and reliability continue to remain the
dominant concepts in test evaluation, and fairness has remained outside the
mainstream.
29
2 Test fairness
test reviews, most of them uniformly discuss the five kinds of validity and
reliability (typically, in terms of test-retest and internal consistency), and a few
reviews discuss differential item functioning and bias. 7 The 10th Volume of
Test Critiques (Keyser and Sweetland 1994) has 106 reviews which include
seven related to language. Although the reviews are longer and not as
constrained as the ones in the MMY, most reviews only discuss the five kinds
of validity and the two kinds of reliability. The Reviews of English Language
Proficiency Tests (Alderson et al. 1987) is the only compilation of reviews of
English language proficiency tests available. There are 47 reviews in all and
they follow the MMYs set pattern of only discussing reliability and validity,
mostly using the trinitarian approach to validity while a few reviews also
include discussions of practicality. There is no direct reference to test fairness.
30
2 Test fairness
test or item bias for certain groups, but might not be able to answer questions
regarding other group differences. Or a single validation study (of, say,
internal structure), while useful in its own right, would have insufficient
validation evidence to claim that the test has all the desirable qualities. Third,
published test reviews are narrow and constrained in such a way that none of
the reviews I surveyed follow Messicks (1989) concepts of test interpretation
and use and evidential and consequences bases of validation, and, therefore,
they do not provide any guidance regarding these matters. In short, based on
the analyses above, test evaluation is conducted narrowly and focuses mainly
on validity and reliability.
31
2 Test fairness
32
2 Test fairness
33
2 Test fairness
34
2 Test fairness
Test users should select tests that have been developed in ways that attempt
to make them as fair as possible for test takers of different races, gender,
ethnic backgrounds, or handicapping conditions.
Test users should:
Evaluate the procedures used by test developers to avoid potentially
insensitive content or language.
Review the performance of test takers of different races, gender and
ethnic backgrounds when samples of sufficient size are available.
Evaluate the extent to which performance differences might have been
caused by inappropriate characteristics of the test.
When necessary and feasible, use appropriately modified forms of tests
or administration procedures for test takers with handicapping
conditions. Interpret standard norms with care in the light of the
modifications that were made.
(Code 1988, p. 45)
The Standards (1999) approach
In the recent Standards (1999), in the chapter entitled, Fairness in testing and
test use, the authors state by way of background that the concern for fairness
in testing is pervasive, and the treatment accorded the topic here cannot do
justice to the complex issues involved. A full consideration of fairness would
explore the many functions of testing in relation to its many goals, including
the broad goal of achieving equality of opportunity in our society (p. 73).
Furthermore, the document acknowledges the difficulty of defining fairness:
the term fairness is used in many different ways and has no single meaning.
It is possible that two individuals may endorse fairness in testing as a desirable
social goal, yet reach quite different conclusions (p. 74). With this caveat, the
authors outline four principal ways in which the term is used 15:
The first two characterisations... relate fairness to absence of bias and to
equitable treatment of all examinees in the testing process. There is broad
consensus that tests should be free from bias... and that all examinees
should be treated fairly in the testing process itself (e.g. afforded the same
or comparable procedures in testing, test scoring, and use of scores). The
third characterisation of test fairness addresses the equality of testing
outcomes for examinee subgroups defined by race, ethnicity, gender,
disability, or other characteristics. The idea that fairness requires equality
in overall passing rates for different groups has been almost entirely
repudiated in the professional testing literature. A more widely accepted
view would hold that examinees of equal standing with respect to the
construct the test is intended to measure should on average earn the same
test score, irrespective of group membership... The fourth definition of
fairness relates to equity in opportunity to learn the material covered in an
35
2 Test fairness
36
2 Test fairness
learning and opportunity to learn, absence of bias in test content, language and
response patterns, and comparability in selection. It is these characteristics that
form the backbone of the framework that I propose below.
The Test Fairness framework
The Test Fairness framework views fairness in terms of the whole system of
a testing practice, not just the test itself. Therefore, following Willingham and
Cole (1997), multiple facets of fairness that includes multiple test uses (for
intended and unintended purposes), multiple stakeholders in the testing
process (test takers, test users, teachers and employers), and multiple steps in
the test development process (test design, development, administration and
use) are implicated. Thus, the model has five main qualities: validity, absence
of bias, access, administration, and social consequences. Table 1 (see page 46)
presents the model with the main qualities and the main focus for each of
them. A brief explanation of the qualities follows:
1 Validity: Validity of a test score interpretation can be used as part of the
test fairness framework when the following four types of evidence are
collected.
a) Content representativeness or coverage evidence: This type of evidence
(sometimes simply described as content validity) refers to the adequacy
with which the test items, tasks, topics and language dialect represent the
test domain.
b) Construct or theory-based validity evidence: This type of evidence
(sometimes described as construct validity) refers to the adequacy with
which the test items, tasks, topics and language dialect represent the
construct or theory or underlying trait that is measured in a test.
c) Criterion-related validity evidence: This type of evidence (sometimes
described as criterion validity) refers to whether the test scores under
consideration meet criterion variables such as school or college grades
and on the job-ratings, or some other relevant variable.
d) Reliability: This type of evidence refers to the reliability or consistency
of test scores in terms of consistency of scores on different testing
occasions (described as stability evidence), between two or more
different forms of a test (alternate form evidence), between two or more
raters (inter-rater evidence), and in the way test items measuring a
construct functions (internal consistency evidence).
2 Absence of bias: Absence of bias in a test can be used as part of the test
fairness framework when evidence regarding the following is collected.
a) Offensive content or language: This type of bias refers to content that is
offensive to test takers from different backgrounds, such as stereotypes
of group members and overt or implied slurs or insults (based on gender,
37
2 Test fairness
race and ethnicity, religion, age, native language, national origin and
sexual orientation).
b) Unfair penalisation based on test takers background: This type of bias
refers to content that may cause unfair penalisation because of a test
takers group membership (such as that based on gender, race and
ethnicity, religion, age, native language, national origin and sexual
orientation).
c) Disparate impact and standard setting: This type of bias refers to
differing performances and resulting outcomes by test takers from
different group memberships. Such group differences (as defined by
salient test-taker characteristics such as gender, race and ethnicity,
religion, age, native language, national origin and sexual orientation) on
test tasks and sub-tests should be examined for Differential Item/Test
Functioning (DIF/DTF)16. In addition, a differential validity analysis
should be conducted in order to examine whether a test predicts success
better for one group than for another. In terms of standard-setting, test
scores should be examined in terms of the criterion measure and
selection decisions. Test developers and users need to be confident that
the appropriate measure and statistically sound and unbiased selection
models are in use 17. These analyses should indicate to test developers
and test users that group differences are related to the abilities that are
being assessed and not to construct-irrelevant factors.
3 Access: Access to a test can be used as part of the test fairness framework
when evidence regarding the following provisions is collected.
a) Educational access: This refers to whether or not a test is accessible to
test takers in terms of opportunity to learn the content and to become
familiar with the types of task and cognitive demands.
b) Financial access: This refers to whether a test is affordable for test
takers.
c) Geographical access: This refers to whether a test site is accessible in
terms of distance to test takers.
d) Personal access here refers to whether a test provides certified test
takers who have physical and/or learning disabilities with appropriate
test accommodations. The 1999 Standards and the Code (1988) call for
accommodation to be such that test takers with special needs are not
denied access to tests that can be offered without compromising the
construct being measured.
e) Conditions or equipment access: This refers to whether test takers are
familiar with the test taking equipment (such as computers), procedures
(such as reading a map), and conditions (such as using planning time).
4 Administration: Administration of a test can be used as part of the test
fairness framework when evidence regarding the following conditions is
collected:
38
2 Test fairness
39
2 Test fairness
Conclusion
In conclusion, this paper argues for a test fairness framework in language
testing. This conceptualisation gives primacy to fairness and, in my view, if a
test is not fair there is little value in a test having qualities such as validity and
reliability of test scores. Therefore, this model consists of five interrelated test
qualities: validity, absence of bias, access, administration, and social
consequences.
The notion of fairness advanced here is based on the work of the Code
(1988), the Standards (1999), and Willingham and Coles (1997) notion of
comparable validity. This framework also brings to the forefront two
qualities (access and administration) that are ignored or suppressed in earlier
frameworks, as these qualities have not been seen as part of the responsibility
of test developers. They have generally been delegated to test administrators
and local test managers, but I propose that these two qualities should be
monitored in the developmental stages and not left to the test administrators.
This framework, then, is a response to current concerns about fairness in
testing and to recent discussions of applied ethics relevant to the field. Its
applicability in varied contexts for different tests and testing practices in many
countries would be a necessary test of its robustness. Further, I hope that the
framework can influence the development of shared operating principles
among language assessment professionals, so that fairness is considered vital
to the professional and that societies benefit from tests and testing practices.
To sum up, as Rawls (1971) asserted, one of the principles of fairness is that
institutions or practices must be just. Echoing Rawls, then, there is no other
way to develop tests and testing practice than to make them such that primarily
there is fairness and justice for all. This is especially true in an age of
increasingly information-technology-based assessment, where the challenge,
in Barbours (1993) words, would be to imagine technology used in the
service of a more just, participatory, and sustainable society on planet earth
(p. 267).
40
2 Test fairness
References
Alderman, D. and P. W. Holland. 1981. Item performance across native
language groups on the Test of English as a Foreign Language. Princeton:
Educational Testing Service.
Alderson, J. C. and A. Urquhart. 1985a. The effect of students academic
discipline on their performance on ESP reading tests. Language Testing 2:
192204.
Alderson, J. C. and A. Urquhart. 1985b. This test is unfair: Im not an
economist. In P. Hauptman, R. LeBlanc and M.B. Wesche (eds.), Second
Language Performance Testing. Ottawa: University of Ottawa Press.
Alderson, J. C., K. Krahnke and C. Stansfield. 1987 (eds.), Reviews of English
Language Proficiency Tests. Washington, DC: TESOL.
Alderson, J. C., C. Clapham and D. Wall. 1995. Language Test Construction
and Evaluation. Cambridge, UK: Cambridge University Press.
American Psychological Association 1954. Technical Recommendations for
Psychological Tests and Diagnostic Techniques. Washington, DC. Author.
American Psychological Association 1966. Standards for Educational and
Psychological Tests and Manuals. Washington, DC. Author.
American Psychological Association 1974. Standards for Educational and
Psychological Tests. Washington, DC. Author.
American Psychological Association 1985. Standards for Educational and
Psychological Testing. Washington, DC. Author.
American Psychological Association 1999. Standards for Educational and
Psychological Tests. Washington, DC. Author.
Angoff, W. 1988. Validity: an evolving concept. In H. Wainer and H. Braun
(eds.), Test Validity (pp. 1932). Hillsdale, NJ: Lawrence Erlbaum
Associates.
Bachman, L. 1990. Fundamental Considerations in Language Testing.
Oxford, UK: Oxford University Press.
Bachman, L., F. Davidson, K. Ryan and I-C. Choi. 1995. An Investigation into
the Comparability of Two Tests of English as a Foreign Language.
Cambridge, UK: Cambridge University Press.
Bachman, L. and A. Palmer. 1996. Language Testing in Practice. Oxford,
UK: Oxford University Press.
Barbour, I. 1993. Ethics in an Age of Technology. San Francisco, CA: Harper
Collins.
Baron, M., P. Pettit and M. Slote (eds.). 1997. Three Methods of Ethics.
Malden, MA: Blackwell.
Brown, A. 1993. The role of test-taker feedback in the test development
process: Test takers reactions to a rape-mediated test of proficiency in
spoken Japanese. Language Testing 10: 3, 277304.
Brown, J. D. 1996. Testing in Language Programs. Upper Saddle River, NJ:
Prentice-Hall Regents.
41
2 Test fairness
Camilli, G. and L. Shepard. 1994. Methods for Identifying Biased Test Items.
Thousand Oaks, CA: Sage.
Canale, M. 1988. The measurement of communicative competence. Annual
Review of Applied Linguistics 8: 6784.
Chen, Z. and G. Henning. 1985. Linguistic and cultural bias in language
proficiency tests. Language Testing 2: 155163.
Cizek, G. (ed.). 2001. Setting Performance Standards. Mahwah, NJ:
Lawrence Erlbaum Associates.
Clapham, C. 1996. The Development of IELTS. Cambridge, UK: Cambridge
University Press.
Clapham, C. 1998. The effect of language proficiency and background
knowledge on EAP students reading comprehension. In A. J. Kunnan
(ed.), Validation in Language Assessment (pp. 141168). Mahwah, NJ:
Lawrence Erlbaum Associates.
Code of Fair Testing Practices in Education. 1988. Washington, DC: Joint
Committee on Testing Practices. Author.
Corson, D. 1997. Critical realism: an emancipatory philosophy for applied
linguistics? Applied Linguistics, 18: 2, 166188.
Crisp, R. and M. Slote (eds.). 1997. Virtue Ethics. Oxford, UK: Oxford
University Press.
Cumming, A. 1994. Does language assessment facilitate recent immigrants
participation in Canadian society? TESL Canada Journal 2: 2, 117133.
Davies, A. (ed.). 1968. Language Testing Symposium: A Psycholinguistic
Approach. Oxford, UK: Oxford University Press.
Davies, A. 1977. The Edinburgh Course in Applied Linguistics, Vol. 4.
London, UK: Oxford University Press.
Davies, A. (Guest ed.). 1997a. Ethics in language testing. Language Testing
14: 3.
Davies, A. 1997b. Demands of being professional in language testing.
Language Testing 14: 3, 328339.
Elder, C. 1996. What does test bias have to do with fairness? Language
Testing 14: 261277.
Educational Testing Service 1997. Program Research Review. Princeton, NJ:
Author.
Frankena, W. 1973. Ethics, 2nd ed. Saddle River, NJ: Prentice-Hall.
Genesee, F. and J. Upshur 1996. Classroom-based Evaluation in Second
Language Education. Cambridge, UK: Cambridge University Press.
Ginther, A. and J. Stevens 1998. Language background, ethnicity, and the
internal construct validity of the Advanced Placement Spanish language
examination. In A. J. Kunnan (ed.), Validation in Language Assessment
(pp. 169194). Mahwah, NJ: Lawrence Erlbaum Associates.
Groot, P. 1990. Language testing in research and education: The need for
standards. AILA Review 7: 923.
42
2 Test fairness
Hale, G. 1998. Student major field and text content: Interactive effects on
reading comprehension in the TOEFL. Language Testing 5: 4961.
Hamp-Lyons, L. 1997a. Washback, impact and validity: ethical concerns.
Language Testing 14:3, 295303.
Hamp-Lyons, L. 1997b. Ethics in language testing. In C. Clapham and D.
Corson (eds.), Encyclopedia of Language and Education. (Volume 7,
Language Testing and Assessment) (pp. 323333). Dordrecht, The
Netherlands: Kluwer Academic Publishers.
Harris, D. 1969. Testing English as a Second Language. New York, NY:
McGraw-Hill.
Henning, G. 1987. A Guide to Language Testing. Cambridge, MA: Newbury
House.
Holland, P. and H. Wainer (eds.). 1993. Differential Item Functioning.
Hillsdale, NJ: Lawrence Erlbaum Associates.
Hughes, A. 1989. Testing for Language Teachers. Cambridge, UK:
Cambridge University Press.
Impara, J. and B. Plake (eds.). 1998. 13th Mental Measurements Yearbook.
Lincoln, NE: The Buros Institute of Mental Measurements, University of
Nebraska-Lincoln.
International English Language Testing System. 1999. Research Reports.
Cambridge, UK: UCLES.
Kane, M. 1992. An argument-based approach to validity. Psychological
Bulletin 112: 527535.
Keyser, D. and R. Sweetland (eds.). 1994. Test Critiques 10. Austin, TX:
Pro-ed.
Kim. J-O. and C. Mueller 1978. Introduction to Factor Analysis. Newbury
Park, CA: Sage.
Kunnan, A. J. 1990. DIF in native language and gender groups in an ESL
placement test. TESOL Quarterly 24: 741746.
Kunnan, A. J. 1995. Test Taker Characteristics and Test Performance: A
Structural Modelling Approach. Cambridge, UK: Cambridge University
Press.
Kunnan, A. J. 2000. Fairness and justice for all. In A. J. Kunnan (ed.),
Fairness and Validation in Language Assessment (pp. 114). Cambridge,
UK: Cambridge University Press.
Lado, R. 1961. Language Testing. London, UK: Longman.
Lynch, B. 1997. In search of the ethical test. Language Testing 14: 3,
315327.
Messick, S. 1989. Validity. In R. Linn (ed.), Educational Measurement (pp.
13103). London: Macmillan.
Norton, B. 1997. Accountability in language testing. In C. Clapham and D.
Corson (eds.), Encyclopedia of Language and Education. (Volume 7,
Language Testing and Assessment) (pp. 313322). Dordrecht, The
Netherlands: Kluwer Academic Publishers.
43
2 Test fairness
Norton, B. and P. Stein. 1998. Why the monkeys passage bombed: tests,
genres, and teaching. In A. J. Kunnan (ed.), Validation in Language
Assessment. (pp. 231249). Mahwah, NJ: Lawrence Erlbaum Associates.
Oltman, P., J. Stricker and T. Barrows. 1988. Native language, English
proficiency and the structure of the TOEFL. TOEFL Research Report 27.
Princeton, NJ: Educational Testing Service.
Pojman, L. 1999. Ethics, 3rd ed. Belmont, CA: Wadsworth Publishing Co.
Rawls, J. 1971. A Theory of Justice. Cambridge, MA: Belknap Press of
Harvard University Press.
Ross, W. 1930. The Right and the Good. Oxford, UK: Oxford University
Press.
Ryan, K. and L. F. Bachman. 1992. Differential item functioning on two tests
of EFL proficiency. Language Testing 9:1, 1229.
Sen, A. and B. Williams. 1982. (eds.). Utilitarianism and Beyond. Cambridge,
UK: Cambridge University Press.
Shohamy, E. 1997. Testing methods, testing consequences: are they ethical?
Are they fair? Language Testing 14: 340349.
Smart, J. and B. Williams. 1973. Utilitarianism; For and Against. Cambridge,
UK: Cambridge University Press.
Spolsky, B. 1981. Some ethical questions about language testing. In C. Klein-
Braley and D. K. Stevenson (eds.), Practice and Problems in Language
Testing 1 (pp. 521). Frankfurt, Germany: Verlag Peter Lang.
Spolsky, B. 1995. Measured Words. Oxford, UK: Oxford University Press.
Spolsky, B. 1997. The ethics of gatekeeping tests: what have we learned in a
hundred years? Language Testing 14: 3, 242247.
Stansfield, C. 1993. Ethics, standards, and professionalism in language
testing. Issues in Applied Linguistics 4: 2, 189206.
Stevenson, D. K. 1981. Language testing and academic accountability: on
redefining the role of language testing in language teaching. International
Review of Applied Linguistics 19: 1530.
Stigler, S. 1986. The History of Statistics. Cambridge, MA: Belknap Press of
Harvard University Press.
Taylor, C., J. Jamieson, D. Eignor and I. Kirsch. 1998. The relationship
between computer familiarity and performance on computer-based
TOEFL tests tasks. TOEFL Research Report No. 61. Princeton, NJ:
Educational Testing Research.
University of Michigan English Language Institute 1996. MELAB Technical
Manual. Ann Arbor, MI: University of Michigan Press. Author.
Valds, G. and R. Figueroa. 1994. Bilingualism and Testing: A Special Case
of Bias. Norwood, NJ: Lawrence Erlbaum Associates.
Wall, D. and Alderson, C. 1993. Examining washback: the Sri Lankan impact
study. Language Testing 10: 4170.
44
2 Test fairness
45
2 Test fairness
Appendix 1
Test fairness framework
Table 1: Test fairness framework
1. Validity
Content representativeness/coverage Representativeness of items, tasks, topics
Construct or theory-based validity Representation of construct/underlying trait
Criterion-related validity Test score comparison with external criteria
Reliability Stability, Alternate form, Inter-rater and
Internal consistency
2. Absence of bias
Offensive content or language Stereotypes of population groups
Unfair penalisation Content bias based on test takers background
Disparate impact and standard setting DIF in terms of test performance; criterion setting
and selection decisions
3. Access
Educational Opportunity to learn
Financial Comparable affordability
Geographical Optimum location and distance
Personal Accommodations for test takers with disabilities
Equipment and conditions Appropriate familiarity
4. Administration
Physical setting Optimum physical settings
Uniformity and security Uniformity and security
5. Social consequences
Washback Desirable effects on instruction
Remedies Re-scoring, re-evaluation; legal remedies
46
2 Test fairness
Appendix 2
Notes
1 Angoff (1988) notes that this shift is a significant change.
2 See this document for a full listing of titles and abstracts of research studies from
1960 to 1996 for TOEFL as well as other tests such as SAT, GRE, LSAT and
GMAT.
3 The FCE stands for First Certificate in English, CPE for Certificate of Proficiency
in English and IELTS for International English Language Testing Service.
4 Another organisation, the Association of Language Testers of Europe (ALTE), of
which UCLES is a member, has a Code of Practice that closely resembles the Code
of Fair Testing Practices in Education (1988). However, there are no published test
evaluation reports that systematically apply the Code.
5 MELAB stands for the Michigan English Language Assessment Battery.
6 Recent reviews in Language Testing of the MELAB, the TSE, the APIEL and the
TOEFL CBT have begun to discuss fairness (in a limited way) along with
traditional qualities such as validity and reliability.
7 This uniformity is probably also due to the way in which MMY editors prefer to
conceptualise and organise reviews under headings, such as description, features,
development, administration, validity, reliability and summary.
8 For DIF methodology, see Holland and Wainer (1993) and Camilli and Shepard
(1994).
9 For arguments for and against utilitarianism, see Smart and Williams (1973) and
Sen and Williams (1982).
10 Bentham, the classical utilitarian, invented a scheme to measure pleasure and pain
called the Hedonic calculus, which registered seven aspects of a pleasurable or
painful experience: intensity, duration, certainty, nearness, fruitfulness, purity and
extent. According to this scheme, summing up the amounts of pleasure and pain for
sets of acts and then comparing the scores could provide information as to which
acts were desirable.
11 See Ross (1930) and Rawls (1971) for discussions of this system.
12 See Crisp and Slote (1997) and Baron, Pettit and Slote (1997) for discussions of
virtue-based ethics. Non-secular ethics such as religion-based ethics, non-Western
ethics such as African ethics, and feminist ethics are other ethical systems that may
be appropriate to consider in different contexts.
13 See Rawls (1971) A Theory of Justice for a clear exposition of why it is necessary
to have an effective sense of justice in a well-ordered society.
14 These principles are articulated in such a way that they complement each other and
if there is a situation where the two principles are in conflict, Principle 1 (The
Principle of Justice) will have overriding authority. Further, the sub-principles are
only explications of the principles and do not have any authority on their own.
15 The authors of the document also acknowledge that many additional interpretations
of the term fairness may be found in the technical testing and the popular
literature.
16 There is substantial literature that is relevant to bias and DIF in language testing.
For empirical studies, see Alderman and Holland (1981), Chen and Henning
(1985), Zeidner (1986, 1987), Oltman et al. (1988), Kunnan (1990), Ryan and
Bachman (1992).
47
2 Test fairness
17 For standard setting, the concept and practice, see numerous papers in Cizek
(2001).
18 In the US, Title VII of the Civil Rights Act of 1964 provides remedies for persons
who feel they are discriminated against owing to their gender, race/ethnicity, native
language, national original, and so on. The Family and Education Rights and
Privacy Act of 1974 provides for the right to inspect records such as tests and the
right to privacy limiting official school records only to those who have legitimate
educational needs. The Individuals with Disabilities Education Amendments Act
of 1991 and the Rehabilitation Act of 1973 provide for the right of parental
involvement and the right to fairness in testing. Finally, the Americans with
Disabilities Act of 1990 provides for the right to accommodated testing. These Acts
have been used broadly to challenge tests and testing practices in court.
48
Section 2
Research Studies
3
Qualitative research methods
in language test development
and validation
Anne Lazaraton
University of Minnesota, Minneapolis, MN USA
Introduction
In a comprehensive, state-of-the-art article that appeared in Language
Testing, Lyle Bachman (2000) overviews many of the ways in which language
testing has matured over the last twenty years for example, practical
advances have taken place in computer-based assessment, we have a greater
understanding of the many factors (e.g. characteristics of both test takers and
the test-taking process) that affect performance on language tests, there is a
greater emphasis on performance assessment, and there is a new concern for
ethical issues in language testing. Furthermore, Bachman points to the
increasing sophistication and diversity of quantitative methodological
approaches in language-testing research, including criterion-referenced
measurement, generalisability theory, and structural equation modelling.
In my opinion, however, the most important methodological development
in language-testing research over the last decade or so has been the
introduction of qualitative research methodologies to design, describe and,
most importantly, to validate language tests. Bachman (2000), Banerjee and
Luoma (1997), and Taylor and Saville (2001), among others, note the ways in
which such qualitative research methodologies can shed light on the complex
relationships that exist among test performance, test-taker characteristics and
strategies, features of the testing process, and features of testing tasks, to name
a few. That is, language testers have generally come to recognise the
limitations of traditional statistical methods for validating language tests and
have begun to consider more innovative approaches to performance test
validation, approaches which promise to illuminate the assessment process
itself, rather than just assessment outcomes.
In this paper, I explore the role of qualitative research in language test
development and validation by briefly overviewing the nature of qualitative
research and examining how such research can support our work as language
51
3 Qualitative research methods in language test development and validation
testers; discussing in more detail one such approach, the discourse analytic
approach of conversation analysis; describing how this qualitative research
method has been employed by language testers in the development and
validation of ESL/EFL examinations, especially oral language tests, and by
showing some recently completed work in this area and describing some
practical outcomes derived from this research. Finally, I note some limitations
of this type of qualitative research that should be kept in mind and point to
other ways in which language testing research using qualitative methods might
be conducted.
Figure 1
naturalistic controlled
observational experimental
subjective objective
descriptive inferential
process-orientated outcome-orientated
valid reliable
holistic particularistic
real, rich, deep data hard, replicable data
ungeneralisable single case analysis generalisable aggregate analysis
52
3 Qualitative research methods in language test development and validation
(1986) warned that a reliance on statistical analyses alone could not give us a
full understanding of what a test measures, that is, its construct validity; he
proposed employing more introspective techniques for understanding
language tests. Fulcher (1996) observes that test designers are employing
qualitative approaches more often, a positive development seeing that many
testing instruments do not contain a rigorous applied linguistics base, whether
the underpinning be theoretical or empirical. The results of validation studies
are, therefore, often trivial (p. 228). While an in-depth discussion of test
validity and construct validation is beyond the scope of this paper, it should be
noted that there is a growing awareness that utilising approaches from other
research traditions to validate language tests is called for. For example,
Kunnan (1998a) maintains that although validation of language (second and
foreign language) assessment instruments is considered a necessary technical
component of test design, development, maintenance, and research as well as
a moral imperative for all stakeholders who include test developers, test-score
users, test stockholders, and test takers, only recently have language
assessment researchers started using a wide variety of validation approaches
and analytical and interpretative techniques (p. ix). More to the point,
McNamara (1996: 8586) argues that current approaches to test validation put
too much emphasis on the individual candidate. Because performance
assessment is by nature interactional, we need to pay more attention to the co-
constructed nature of assessment. In fact the study of language and
interaction continues to flourish ... although it is too rarely cited by researchers
in language testing, and almost not at all by those proposing general theories
of performance in second language tests; this situation must change.
Hamp-Lyons and Lynch (1998) see some reason for optimism on this point,
based on their analysis of the perspectives on validity present in Language
Testing Research Colloquium (LTRC) abstracts. Although they conclude that
the LTRC conference is still positivist-psychometric dominated, it has been
able to allow, if not yet quite welcome, both new psychometric methods and
alternative assessment methods, which has led to new ways of constructing
and arguing about validity (p. 272).
Undoubtedly, there is still much to be said on this issue. Whatever approach
to test validation is taken, we must not lose sight of what is important in any
assessment situation: that decisions made on the basis of test scores are fair,
because the inferences from scores are reliable and valid (Fulcher 1999:
234). That is, the emphasis should always be upon the interpretability of test
scores (p. 226). In the empirical studies described in this paper, the overriding
aim was to ensure confidence in just such interpretations based on the scores
that the tests generated.
Specifically, it seems clear that more attention to and incorporation of
discourse analysis in language test validation is needed. Fulcher (1987)
remarks that a new approach to construct validation in which the construct
53
3 Qualitative research methods in language test development and validation
54
3 Qualitative research methods in language test development and validation
55
3 Qualitative research methods in language test development and validation
used to support the test development and validation process for several of the
Cambridge EFL Speaking Tests.
Very briefly, Cambridge EFL examinations are taken by more than 600,000
people in around 150 countries yearly to improve their employment prospects,
to seek further education, to prepare themselves to travel or live abroad, or
because they want an internationally recognised certificate showing the level
they have attained in the language (UCLES 1999a). The exams include a
performance testing component, in the sense that they assess candidates
ability to communicate effectively in English by producing both a written and
an oral sample; these components are integral parts of the examinations. The
examinations are linked to an international system of levels for assessing
European languages established by the Association of Language Testers in
Europe (ALTE), consisting of five user levels.
The Main Suite Cambridge EFL Examinations test General English and
include the Certificate of Proficiency in English (CPE, Level 5), the
Certificate in Advanced English (CAE, Level 4), the First Certificate in
English (FCE, Level 3), the Preliminary English Test (PET, Level 2), and the
Key English Test (KET, Level 1). English for Academic Purposes is assessed
by the International English Language Testing System (IELTS), jointly
administered by UCLES, The British Council, and IDP Education Australia.
IELTS provides proof of the language ability needed to study in English at
degree level. Results are reported in nine bands, from Band 1 (Non-User) to
Band 9 (Expert User). Band 6 is approximately equivalent to a good pass at
Level 3 of the five-level ALTE scale. The Cambridge Assessment of Spoken
English (CASE) was the prototype examination designed between 1990 and
1992, which has been influential in the development of the Cambridge
Speaking Tests, such as FCE, IELTS, etc.
At the operational level of the speaking tests, attention to a number of test
facets is evident. The Speaking Tests require that an appropriate sample of
spoken English be elicited, and that the sample be rated in terms of
predetermined descriptions of performance. Therefore, valid and reliable
materials and criterion-rating scales, as well as a professional Oral Examiner
cadre, are fundamental components in this enterprise. These concerns are
addressed in the following ways:
a paired format is employed;
examiner roles are well defined;
test phases are predetermined;
a standardised format is used;
assessment criteria are based on a theoretical model of language ability and
a common scale for speaking;
oral examiners are trained, standardised, and monitored.
56
3 Qualitative research methods in language test development and validation
57
3 Qualitative research methods in language test development and validation
Rephrasing questions
(2) CASE Candidate 8 (3:2-18) Examiner 2
---> IN: and do you think you will stay in the same (.) um (.)
---> area? in the same company?
(.5)
---> IN: [in the future? (.) in your job.
5 CA: [same?
CA: m-
(.)
CA: [same company?,
---> IN: [in the f- (.) in the same company in the future.
10 (1.0)
CA: uh: .hhh uh- (.) my companys eh (.) country?
---> IN: .hhh no .hhh in (.) in your future career?,=[will you stay
CA: =[hmm
---> IN: with (.) in the same area? in pharmaceuticals?
---> or do you think you may change
CA: oh: ah .hhh I want to stay (.) in the same (.) area.
The interlocutor evaluates the candidates job by saying sounds
interesting in (3), a type of response not suggested by the CASE Interlocutor
Frame:
Evaluating responses
(3) CASE Candidate 37 (1:23-34) Examiner 2
IN: whats your job
CA: Im working in an advertising agency ... our company manages psk
all advertising plan? for our clients. .hhh and Im a (.) media planner for
radio there.
5 IN: media planner for [radio.
CA: [yah.
---> IN: sounds interesting?
CA: mmhmm.
In (4), the interlocutor does an embedded correction of the preposition in
by replacing it with the correct preposition to in line 3:
Repeating and/or correcting responses
(4) CASE Candidate 36 (2:7-10) Examiner 2
IN: which country would you like to go to.
CA: I: want to go: (.) in Spain.
---> IN: to Spain. [ah. (.) why? Spain.
CA: [yah
In (5), the interlocutor states, rather than asks, a question prescribed by the
58
3 Qualitative research methods in language test development and validation
59
3 Qualitative research methods in language test development and validation
Candidate behaviour
A second set of Cambridge-commissioned work dealt with candidate language
produced on FCE and IELTS. However, because the focus of this research was
solely on the features of candidate language, conversation analysis, which
analyses dyadic interaction, could not be used. In these cases, a broader
approach of discourse analysis was considered a viable option for
understanding candidate speech production within the context of an oral
examination.
1. Research on FCE
The study on FCE was part of a larger FCE Speaking Test Revision Project,
which took place during the five years preceding the implementation of the
new version in 1996. Generally, according to Taylor (1999), revisions of
Cambridge EFL tests aim to account for the current target use of the
candidates/learners, as well as developments in applied linguistics and
description of language, in models of language-learning abilities and in
60
3 Qualitative research methods in language test development and validation
pedagogy, and in test design and measurement. The process typically begins
with research, involving specially commissioned investigations, market
surveys and routine test analyses, which look at performance in the five skill
areas, task types, corpus use, and candidate demographics. With FCE, a
survey of 25,000 students, 5,000 teachers, 1,200 oral examiners and 120
institutions asked respondents about their perspectives on the proposed
revisions. This work is followed by an iterative cycle of test draft, trialling,
and revision.
As a result, UCLES had a specific research question in mind for the FCE
study, which was used to guide the analyses: What is the relationship between
the task features in the four parts of the revised FCE Speaking Test and the
candidate output in terms of speech production? The rationale for this question
was to establish that the features of speech that are purported to be evaluated
by the rating criteria are in fact produced by the candidates. The study
consisted of two parts: first, data from the 1996 FCE Standardisation Video
was analysed in order to provide supplementary information to the
standardisation video materials, where appropriate, and to provide a
framework for examining candidate output in a dataset of live FCE Speaking
Tests that formed the basis of the second study. In the second study, Lazaraton
and Frantz (1997) analysed a corpus of live data from NovemberDecember
1996 FCE test administrations. The rationale for this second project was again
to establish that the features of speech which are predicted as output and which
are to be evaluated by the rating criteria are actually produced by the
candidates, and then to make recommendations, if necessary, about how the
descriptions of the test may be amended to make the descriptions fit the likely
speech output from the candidates. These two projects analysed the speech
produced by 61 candidates in various administrations of FCE.
In both studies, the transcripts were analysed for the speech functions
employed by the candidates in each part of the examination. This was
accomplished by dividing each transcript into four parts and labelling
candidate speech functions present in each section. The hypothesised speech
functions that are described in the FCE materials (UCLES 1996: 2627) were
used as a starting point and were modified or supplemented as the data
analysis progressed.
A total of 15 speech functions were identified in the transcripts, a number
of which were ones that UCLES had identified as predicted candidate output
functions. Some, however, were components of the expected functions which
were thought to be too broad, too vague, or which showed too much overlap
with another function. That is, then, some of the expected functions were
divided into more specific functions. Finally, a few of the 15 identified
functions were ones which were not predicted in the FCE materials, either
directly or as part of a broader function. The analysis indicated that candidates,
for the most part, did employ the speech functions that are hypothesised in the
61
3 Qualitative research methods in language test development and validation
printed FCE materials. Part 2, where candidates are required to produce a one-
minute-long turn based on pictures, showed the most deviation from the
expected output. In this section, UCLES hypothesised that candidates would
engage in giving information and expressing opinions through comparing and
contrasting. While these speech functions did occur in the data analysed,
candidates also engaged in describing, expressing an opinion, expressing a
preference, justifying (an opinion, preference, choice, life decision) and
speculating. In fragment (7) below, the candidate spends most of her time
speculating about the feelings of the people in each picture, as she is directed,
but does not compare or contrast. Here is how Lazaraton and Frantz analysed
this response:
(7) FCE Candidate 43 (2:140-153) Examiner 377, Part 2
(Task: Couples: Id like you to compare and contrast these
pictures saying how you think the people are feeling)
1 yeah (.2) from the first picture I can see .hhh these two (.)
(description)
2 people they: seems not can:: cannot enjoy their .hhh their meal
(speculation)
3 (.) because these girls face I think shes: um (think) I think
(justification)
4 shes: .hhh (.2) an- annoyed or something its not impatient
5 and this boy: (.) shes also (.2) looks boring (.2) yeah I I
(speculation)
6 think they cannot enjoy the: this atmosphere maybe the: .hhh
(justification)
7 the:: waiter is not servings them (.) so they feel so (.) bored
(speculation)
8 or (.5) or maybe they have a argue or something like that (1.0)
9 yeah and from the second picture (.8) mmm::: this: rooms mmm:
(description)
10 looks very warm (.) and uh .hhh (.2) mmm these two people? (.)
11 they also canno- I think they are not talking to each other .hhh
(speculation)
12 they just (.) sit down over there and uh (.5) these gentleman
(description)
13 just smoking (.) yeah and this woman just look at her finger
In short, it was hoped that the results of these two studies would be useful
to FCE test developers and trainers in making more accurate assessments of
candidate output in the examination. Although there is no explicit task-
achievement rating scheme for FCE, we suggested that the list of 15 speech
functions generated from these data might prove helpful in developing one.
Additionally, the list might be useful for analysing candidate output in other
62
3 Qualitative research methods in language test development and validation
63
3 Qualitative research methods in language test development and validation
64
3 Qualitative research methods in language test development and validation
Discussion
To summarise, I believe that discourse analysis offers us a method with which
we can analyse the interaction that takes place in face-to-face oral assessment,
which, until very recently, was overlooked in the test validation process.
These various studies on different Cambridge EFL Speaking Tests have
informed us about how interlocutors and candidates behave during the test,
and how this behaviour approximates to conversation. The IELTS study
attempted to compare features of the discourse produced with assigned band
scores, and along with the FCE studies, made some headway on informing
rating scale construction and validation. The unique contribution of all these
studies to the field of language testing, though, is their demonstration of both
the applicability and the suitability of conversation/discourse analysis for
understanding the process of oral assessment via an examination of the
discourse produced in this context. As Lazaraton (2001b: 174) remarks,
Conversation analysis has much to recommend it as a means of validating
oral language tests Perhaps the most important contribution that CA
can make is in the accessibility of its data and the claims based on them.
That is, for many of us highly sophisticated statistical analyses are
comprehensible only to those versed in those analytic procedures The
results of CA are patently observable, even if one does not agree with the
conclusions at which an analyst may arrive. As such, language testers who
engage in conversation analyses of test data have the potential to reach a much
larger, and less exclusive readership.
However, it is as well to keep in mind a number of shortcomings of the
conversation analytic approach. Aside from a number of theoretical and
conceptual objections reviewed in Lazaraton (2001b), CA is not helpful for
analysing monologic data, where there is an absence of interaction; for the
same reason, CA cannot be applied to the modality of writing. Equally
troubling is the fact that CA is difficult, if not impossible, to learn without the
benefit of tutelage under a trained analyst and/or with others. As a result, the
number of researchers who feel comfortable with and driven to use
the methodology will undoubtedly remain small. Related to this is the
labour-intensive nature of CA, which makes it impractical for looking at large
data-sets.
Another problematic aspect of qualitative research in general is that there is
no clear consensus on how such research should be evaluated. Leaving aside
issues of good language testing practice, discussed recently by Taylor and
Saville at the 2001 AAAL conference, in the larger field of qualitative
research in the social sciences and humanities there is very little agreement on
the need for permanent or universal criteria for judging qualitative research.
Garratt and Hodkinson (1998: 515516) consider the two basic questions of
criteriology, but come to no conclusions: 1) Should we be striving to
65
3 Qualitative research methods in language test development and validation
Conclusion
Language testing is clearly in the midst of exciting changes in perspective. It
has become increasingly evident that the established psychometric methods
for validating oral language tests are effective but limited, and other validation
methods are required for us to have a fuller understanding of the language tests
we use. I have argued that conversation analysis represents one such solution
for these validation tasks. McNamara (1997: 460) sees much the same need,
as he states rather eloquently: Research in language testing cannot consist
only of a further burnishing of the already shiny chrome-plated quantitative
66
3 Qualitative research methods in language test development and validation
armour of the language tester with his (too often) sophisticated statistical tools
and impressive n-size; what is needed is the inclusion of another kind of
research on language testing of a more fundamental kind, whose aim is to
make us fully aware of the nature and significance of assessment as a social
act. As the field of language testing further matures, I am optimistic that we
can welcome those whose interests and expertise lie outside the conventional
psychometric tradition: qualitative researchers like myself, of course, but also
those who take what Kunnan (1998b) refers to as postmodern and radical
approaches to language assessment research. Furthermore, I would also hope,
along with Hamp-Lyons and Lynch, that the stakeholders in assessment, those
that use the tests that we validate, would have a greater voice in the assessment
process in order to ensure that our use of test scores is, first and foremost, a
responsible use.
Notes
This chapter is a slightly revised version of a plenary paper given at the ALTE
conference in Barcelona, July 2001. Portions of this chapter also appear in
Lazaraton (2001b).
67
3 Qualitative research methods in language test development and validation
References
Atkinson, J. M. and J. Heritage (eds.). 1984. Structures of Social Action:
Studies in Conversation Analysis. Cambridge: Cambridge University Press.
Bachman, L. F. 2000. Modern language testing at the turn of the century:
Assuring that what we count counts. Language Testing 17: 142.
Banerjee, J. and S. Luoma. 1997. Qualitative approaches to test validation. In
C. Clapham and D. Corson (eds.), Encyclopedia of Language and
Education, Volume 7: Language Testing and Assessment (pp. 275287).
Amsterdam: Kluwer.
Cohen, A. 1984. On taking language tests: What the students report. Language
Testing 1: 7081.
Davis, K. A. 1995. Qualitative theory and methods in applied linguistics
research. TESOL Quarterly 29: 427453.
Douglas, D. 1994. Quantity and quality in speaking test performance.
Language Testing 11: 125144.
Fulcher, G. 1987. Tests of oral performance: The need for data-based criteria.
ELT Journal 41: 4, 287291.
Fulcher, G. 1996. Does thick description lead to smart tests? A data-based
approach to rating scale construction. Language Testing 13: 2, 208238.
Fulcher, G. 1999. Assessment in English for Academic Purposes: Putting
content validity in its place. Applied Linguistics 20: 2, 221236.
Garratt, D. and P. Hodkinson. 1998. Can there be criteria for selecting research
criteria? A hermeneutical analysis of an inescapable dilemma. Qualitative
Inquiry 4: 515539.
Green, A. 1998. Verbal Protocol Analysis in Language Testing Research: A
Handbook. Cambridge: Cambridge University Press and University of
Cambridge Local Examinations Syndicate.
Grotjahn, R. 1986. Test validation and cognitive psychology: Some
methodological considerations. Language Testing 3: 159185.
Hamp-Lyons, L. and B. K. Lynch. 1998. Perspectives on validity: An
historical analysis of language testing conference abstracts. In A. Kunnan
(ed.), Validation in Language Assessment: Selected Papers from the 17th
Language Testing Research Colloquium, Long Beach (pp. 253276).
Mahwah, NJ: Lawrence Erlbaum.
Hill, K. 1998. The effect of test-taker characteristics on reactions to and
performance on an oral English proficiency test. In A. Kunnan (ed.),
Validation in Language Assessment: Selected Papers from the 17th
Language Testing Research Colloquium, Long Beach (pp. 209229).
Mahwah, NJ: Lawrence Erlbaum Associates.
Kunnan, A. (1998a). Preface. In A. Kunnan (ed.), Validation in Language
Assessment: Selected Papers from the 17th Language Testing Research
Colloquium, Long Beach (pp. ixx). Mahwah, NJ: Lawrence Erlbaum
Associates.
68
3 Qualitative research methods in language test development and validation
69
3 Qualitative research methods in language test development and validation
70
3 Qualitative research methods in language test development and validation
Appendix 1
Transcription Notation Symbols (from Atkinson and
Heritage 1984)
1. unfilled pauses or gaps periods of silence, timed in tenths of a second
by counting beats of elapsed time. Micropauses, those of less than .2
seconds, are symbolised (.); longer pauses appear as a time within
parentheses: (.5) is five tenths of a second.
2. colon (:) a lengthened sound or syllable; more colons prolong the
stretch.
3. dash (-) a cut-off, usually a glottal stop.
4. .hhh an inbreath; .hhh! strong inhalation.
5. hhh exhalation; hhh! strong exhalation.
6. hah, huh, heh, hnh all represent laughter, depending on the sounds
produced. All can be followed by an (!), signifying stronger laughter.
7. (hhh) breathiness within a word.
8. punctuation: markers of intonation rather than clausal structure; a full
point (.) is falling intonation, a question mark (?) is rising intonation, a
comma (,) is continuing intonation. A question mark followed by a
comma (?,) represents rising intonation, but is weaker than a (?). An
exclamation mark (!) is animated intonation.
9. equal sign (=) a latched utterance, no interval between utterances.
10. brackets ([ ]) overlapping talk, where utterances start and/or end
simultaneously.
11. per cent signs (% %) quiet talk.
12. asterisks (* *) creaky voice.
13. carat (^) a marked rising shift in pitch.
14. arrows (> <) the talk speeds up, arrows (< >) the talk slows down.
15. psk a lip smack, tch a tongue click.
16. underlining or CAPS a word or SOund is emphasised.
71
European solutions to
4 non-European problems
Abstract
This paper examines the major issues relating to the proposed introduction of
compulsory assessment of English language proficiency for students prior to
graduation from tertiary education in Hong Kong. It looks at the on-going
debate relating to the introduction of a language exit-test and considers
possible alternatives to formal standardised tests for reporting on language
proficiency. It then describes a study that set out to discover students and their
future employers views on the introduction of such an exit mechanism. The
paper concludes by suggesting how a valid and reliable reporting mechanism
could be developed for students in Hong Kong, by drawing on the current
work being done by the Council of Europes European Portfolio Project.
Introduction
Since 1997, when sovereignty over Hong Kong changed from British rule to
that of the Peoples Republic of China, Hong Kong has to some extent
struggled to establish its own identity. On the one hand is the desire to be
divested of all trappings of the colonial past; on the other is the knowledge that
Hong Kongs viability in the commercial world is largely dependent on its
positioning as a knowledge-based, international business community. This
latter outlook has been at the root of the concern, much discussed in the local
media over the past two decades, that standards of English language
proficiency in Hong Kong are declining.
A succession of Education Committee reports and Policy Addresses by the
Governor (prior to 1997) and Chief Executive (following the return of Hong
Kong to China) have led to the introduction of a number of schemes designed
to strengthen language proficiency throughout the educational system. In the
post-compulsory education sector these include, inter alia, generous grants to
the eight tertiary institutions for language enhancement provisioni; the
development of a new benchmark test, the Language Proficiency Assessment
73
4 European solutions to non-European problems
Background
Despite the willingness of the government to tackle what is perceived to be a
growing problem through the introduction of the measures mentioned above,
business leaders still regularly express concern that students graduating from
Hong Kongs tertiary institutions do not possess the requisite language skills
to perform adequately in the modern workplace. The major local English
language newspaper, the South China Morning Post, regularly publishes
letters to the Editor and articles written by academics and businessmen
lamenting the fact (often with little more than anecdotal evidence) that
students English proficiency is not adequate to meet job demands. However,
the general consensus amongst business leaders appears to be that it is not
within the remit of business to provide training in general language skills for
future employees (Lee and Lam 1994) and that there is a need for school
leavers and graduates to have a good command of English to enter the
business, professional and service sectors (Education Commission 1995: 4).
In order to help employers choose employees whose English is at the right
standard for the companys needs (Hamp-Lyons 1999: 139), funding was
awarded to the Hong Kong Polytechnic University in 1994 for the
74
4 European solutions to non-European problems
development and trialling of test batteries in English and Chinese. The first
full official administration of the resulting Graduating Students Language
Proficiency Assessment in English (GSLPA-English) took place in the
19992000 academic year. The content of the GSLPA-English is explicitly
geared towards the types of professional communication that it is believed
new graduates will face in their careers and is not restricted to the content of
any single course at any one institution. There is no pass/fail element to the
test as such; candidates receive a certificate that simply provides a description
of their performance for both written and spoken Englishv.
Tertiary institutions in Hong Kong have, however, strongly resisted the
introduction of this test for a number of reasons, not the least of which is the
fact that currently there are no degrees awarded in Hong Kong where common
performance mechanisms are required across institutions. Another major
concern of language educators within the institutions is the impact that the
imposition of an exit-test would have on the existing curriculum. Although
Alderson and Wall (1993) and Wall (1996; 1997) argue that there is little
evidence either in general education or language education to support the
notion that tests actually influence teaching, it is generally believed, in Hong
Kong as elsewhere, that high-stakes testing programs strongly influence
curriculum and instruction to the extent that the content of the curriculum is
narrowed to reflect the content of a test. The relationship between teaching,
testing and learning is therefore considered to be one of curricular alignment
(Madaus 1988; Smith 1991; Shepard 1993). Madaus (1988) argues that the
higher the stakes of a test, the more will be the impact, or washback on the
curriculum, in that past exam papers will become the de facto curriculum and
teachers will adjust their teaching to match the content of the exam questions
(see also Shohamy 1993; 1998 and 2001). If, as has been suggested, students
were in future required to achieve a certain degree of language competence as
a condition of graduation, or for gaining entry into high-status professions,
then the stakes would be very high indeed.
Messick (1996: 241) defines washback as the extent to which the use of a
test influences language teachers and learners to do things they would not
otherwise do that promote or inhibit learning. In the past few years revisions
to many of Hong Kongs public examinations have been made in a deliberate
attempt to provoke changes in teaching and learning although, as Hamp-Lyons
(1999) notes, while there may have been discernible changes in the content of
what is taught, there is little evidence that assessment change or innovation has
led to changes in actual teaching practice. There remains, nevertheless, a
concern among Hong Kongs tertiary educators that achieving a high score on
the test would become the major focus of university study, to the detriment of
other important skills. Whatever the resistance of educators and institutions, it
is clear that in the near future some mechanism for reporting on students
language ability on graduation will be adopted in Hong Kong.
75
4 European solutions to non-European problems
76
4 European solutions to non-European problems
1 Berry and Lewkowicz (2000) report the results of a pilot survey carried out at the
University of Hong Kong in April 2000 to solicit views of undergraduate students
about the need for and desirability of introducing compulsory language assessment
prior to graduation. 1418 students took part in the study, which was designed to elicit
students views on the following issues: 1) whether at graduation students should be
required to take a language test to demonstrate their proficiency in English; 2) what
authority should be responsible for administering the test if a test were required; and 3)
whether they considered a portfolio to be a fair and acceptable alternative to a language
test.
77
4 European solutions to non-European problems
Stakeholder Surveys
Before a new system could be suggested, it was necessary to determine
whether Hong Kong was ready to accept a challenge to current beliefs and
practices. To ascertain this and to investigate the types of change that would
be acceptable, we carried out two surveys of the primary stakeholders likely
to be affected by any change in policy, namely students and employers. These
are described below.
Student Survey
The purpose of this survey was to determine students perceptions of the most
effective way to report their English language ability on graduating from
university. To tap the opinions of as many students as possible across a range
of tertiary institutions, we decided to use a questionnaire. This questionnaire
was based on a pilot questionnaire developed at the University of Hong Kong
(HKU) in the previous year (for details see Berry and Lewkowicz 2000). The
revised questionnaire that was again piloted, this time on a small group of
HKU students, was divided into five sections, eliciting: demographic data;
alternative elements which could be included in a portfolio; the usefulness of
different forms of test that could be used; optimal ways of providing future
employers with useful and accurate information about applicants language
proficiency; and a final section inviting additional comments (see Appendix
1).
Responses were received from 1600+ students from seven of the eight
tertiary institutions in Hong Kong. The majority were from first- or second-
year students, 69% and 26% respectively, as they were the most likely to be
affected by any change in policy. They came from a range of disciplines from
engineering and science to the arts, law and the humanities, and for each
targeted discipline there were respondents from at least two institutions.
Although the number of respondents varied across institutions from 40 to
400+, the similarity of responses suggests that there is considerable agreement
among students as to how their language ability should be reported.
78
4 European solutions to non-European problems
Even though most of the students (71%) had been learning English for 16
or more years, and the majority (77%) had been through an English-medium
secondary education, they reported that their exposure to English outside the
formal classroom setting was limited (see Table 1). Despite this, and the
recognition of many respondents (65%) that their English had not improved at
university, they were realistic in recognising that they would be required to
demonstrate their English ability on graduation and they were co-operative in
completing the questionnaire. None took the opportunity of the additional
comments section to say that their level of English was irrelevant or that none
of the ways suggested for reporting their level was appropriate.
When asked to rank-order three possible options for reporting their English
ability to employers, namely using a portfolio which includes a recent test
score, a portfolio with no test score, or a test score on its own, the first option
that is, portfolio plus test score appeared to be the one most favoured (see
Table 2). This suggests that the majority of the students are pragmatic and
recognise that for any report of their language ability to be informative, it
should be comprehensive. Had they selected the most expedient solution, they
would, most probably, have opted simply for a test score.
79
4 European solutions to non-European problems
majority selected an international test as their first choice and the Hong Kong-
wide test as their second choice, the frequencies of response being 939 (58%)
and 861 (53%) respectively (see Table 3).
80
4 European solutions to non-European problems
Survey of Employers
The aim of this survey was to determine whether potential employers, from
both large and small companies, would accept portfolio assessment.
Representatives from the Human Resources Departments of 12 companies,
four multi-nationals, four public-listed companies and four small-medium
enterprises were invited to participate in in-depth, structured interviews at
their convenience. Each interview lasted 4560 minutes, with the interviewers
taking notes during the interview; where possible, the interviews also were
audio-recorded. The interviews, which followed the organising framework of
the student questionnaires, centred on the companies selection procedures for
new graduates, their language requirements, and their attitudes towards exit
tests and portfolio assessment (see Appendix 2). Here we focus on the results
relating to portfolio assessment and tests.
During the interviews it became clear that concepts such as portfolio
assessment, which are well known within educational circles, were not at all
familiar to the business community. It was therefore necessary to provide
interviewees with information about portfolio assessment and show examples
of portfolios such as those being developed for use in Europe. Despite their
initial lack of familiarity with alternative methods of assessment, once the
human resources representatives saw what this would involve, the majority (7
of 12) ranked the option of portfolio including a test score as the best.
However, they added a proviso that employers should not be responsible for
going through applicants portfolios but that each portfolio should contain a
one-page summary verified by an assessment professional.
The company representatives went to some length to explain that their main
purpose in seeing a language assessment at graduation was not simply to help
them with their selection procedures, as they insisted that they would continue
to use their own selection procedures whether or not a language assessment
was introduced. They were, instead, adamant that there was a need for a
mechanism that would ensure the improved language abilities of those
graduating from Hong Kong universities. Complementing their reluctance to
take responsibility for examining applicants portfolios, they also stressed that
the format of the assessment should be determined by those qualified in the
field, emphasising that they did not feel it was their responsibility to say what
should be included in a portfolio if one were to be introduced.
During the discussions it became apparent that employers were often
looking for more than language skills; they also wanted the graduates to have
improved socio-linguistic competence, which is, of course, very difficult to
assess using traditional tests. In addition, they were looking for enhanced oral
as well as written skills, preferably to the level of the many returning graduates
who had studied outside Hong Kong.
Whereas the majority of employers were in agreement with the students
81
4 European solutions to non-European problems
Conclusions
Perhaps one of the main limitations of the student survey is that even though
some of the respondents had undoubtedly experienced compiling a language
portfolio during their time at university, there is no guarantee that all were
fully aware of what this would entail. Despite this, it appears that students
would favour providing comprehensive evidence of their language abilities
and that they are ready and able to participate in any discussions of future
assessment requirements. Employers also seem prepared to accept changes to
current assessment practices and so a way forward would be for Hong Kong
to learn from the seminal work being done in Europe, but to accept that any
system introduced would need to be modified for the particular circumstances
of Hong Kong. This would inevitably take time, especially as an acceptable
reporting mechanism would have to be developed. It would also be necessary
to raise consciousness among the different stakeholders. Our survey of Human
Resources representatives, though restricted, showed that most employers
need information about portfolio assessment if it is to be introduced. It also
showed that there might be a mismatch between employers expectations as to
the language abilities they consider graduates need to enhance, and those
abilities that students deem it necessary to demonstrate.
These are not, however, insurmountable problems and if portfolio
assessment were embraced it would go some way towards ensuring that the
system implemented matched Hong Kongs educational objectives.
Furthermore, if such a system were developed in conjunction with a structured
and rigorous program of research, Hong Kong could end up with an
assessment mechanism that was not only valid and reliable but was also useful
and fair to all.
Acknowledgments
This research was supported by C.R.C.G. grants 10202627 and 10203331
from the University of Hong Kong. We would like to thank our research
assistant, Matthaus Li, for assistance with data input and analysis.
82
4 European solutions to non-European problems
References
Alderson, J. C. and D. Wall. 1993. Does washback exist? Applied Linguistics
14: 2, 115129.
Berry, V. and J. Lewkowicz. 2000. Exit tests: Is there an alternative? In V.
Berry and J. Lewkowicz (eds.) Assessment in Chinese Contexts: Special
Edition of the Hong Kong Journal of Applied Linguistics: 1949.
Brown, J. D. and T. Hudson. 1998. The alternatives in language assessment.
TESOL Quarterly 32: 4, 653676.
Daiker, D., J. Sommers and G. Stygall. 1996. The pedagogical implications of
a college placement portfolio. In E. White, W. Lutz and S. Kamusikiri
(eds.) Assessment of Writing (pp. 257270). New York: The Modern
Language Association of America.
Douglas, D. 2000. Assessing Languages for Specific Purposes. Cambridge:
Cambridge University Press.
Education Commission. 1988. Education Report 3. The Structure of Tertiary
Education and the Future of Private Schools. Hong Kong: Government
Printer.
Education Commission. 1995. Education Report 6. Enhancing Language
Proficiency: A Comprehensive Strategy. Hong Kong: Government Printer.
Education Commission 2000. Learning for Life; Learning through Life:
Reform Proposal for the Education System in Hong Kong. Hong Kong:
Government Printer.
Falvey, P. and D. Coniam. 2000. Establishing writing benchmarks for primary
and secondary teachers of English language in Hong Kong. In V. Berry and
J. Lewkowicz (eds.) Assessment in Chinese Contexts: Special Edition of the
Hong Kong Journal of Applied Linguistics: 128159.
Genesee, F. and J. A. Upshur. 1996. Classroom-Based Evaluation in Second
Language Education. Cambridge: Cambridge University Press.
Hamp-Lyons, L. 1999. Implications of the examination culture for (English
language) education in Hong Kong. In V. Crew, V. Berry and J. Hung
(eds.) Exploring Diversity in the Language Curriculum (pp. 133140).
Hong Kong: The Hong Kong Institute of Education.
Hamp-Lyons, L. and W. Condon. 2000. Assessing the Portfolio: Principles for
Practice, Theory, Research. Cresskill, NJ: Hampton Press.
Lee, N. and A. Lam. 1994. Professional and Continuing Education in Hong
Kong: Issues and Perspectives. Hong Kong: Hong Kong University Press.
Lucas, C. 1992. Introduction: Writing portfolios-changes and challenges. In
K. B. Yancey (ed.) Portfolios in the Writing Classroom: An Introduction
(pp. 111). Urbana, Illinois: NCTE.
Madaus, G. F. 1988. The influence of testing on the curriculum. In L. N.
Tanner (ed.) Critical Issues in the Curriculum: 87th Yearbook of the
National Society for the Study of Education, Part 1 (pp. 83121). Chicago:
University of Chicago Press.
83
4 European solutions to non-European problems
84
4 European solutions to non-European problems
Appendix 1
Questionnaire
It is probable that students will be required to demonstrate their English language
proficiency on graduating from university. Since there are a number of options for
reporting proficiency levels, we are interested in your views as to which option you
consider would best demonstrate your proficiency in English.
This booklet contains FOUR pages including the front and back. Please answer all
questions, including those on the back page.
All responses to this questionnaire will be treated with the utmost confidentiality
and used for research purposes only.
I. Personal Data:
85
4 European solutions to non-European problems
II. Portfolios
86
4 European solutions to non-European problems
III. Tests
Another way of showing your language proficiency is by simply providing a test score.
Below are three alternatives, all of which are being considered. Please rank these
according to which you think would be the most useful to you and to your future
employers (1 = most useful and 3 = least useful)
University-specific test (each institution sets and marks its own tests) ___
Hong Kong-wide test (one test for all graduating students in Hong Kong) ___
International test (a test developed outside Hong Kong and
widely recognised throughout the world, e.g. IELTS; TOEFL) ___
V. Additional Comments:
Please add any further comments you have about demonstrating your English language
proficiency on graduation. You may continue your comments on the first page of this
questionnaire.
_____________________________________________________________________
_____________________________________________________________________
_____________________________________________________________________
Thank You
87
4 European solutions to non-European problems
Appendix 2
88
4 European solutions to non-European problems
I. Company Data
1. On average, approximately how many new graduates do you recruit each year?
_____________________________________________________________________
_____________________________________________________________________
2. What sorts of entry level positions are available?
_____________________________________________________________________
_____________________________________________________________________
3. Do you test potential recruits for language skills? Yes No
4. If you answered yes to question 3, what form does the test take?
_____________________________________________________________________
_____________________________________________________________________
5. What do you perceive as typical language problems in potential recruits?
_____________________________________________________________________
_____________________________________________________________________
6. Can you tell us what you know about the governments proposal for graduating
students exit assessment?
_____________________________________________________________________
_____________________________________________________________________
7. Are you in favour of it? Yes No
8. Are you aware of the possibility of using a language portfolio for assessment
purposes? Yes No
9. If you were to use a portfolio to assess language skills, would you require a cover
page providing summary information? Yes No
10. Can you suggest what information would crucially be contained on a cover page,
if required?
_____________________________________________________________________
_____________________________________________________________________
89
4 European solutions to non-European problems
II. Portfolios
Portfolio assessment is becoming increasingly popular around the world since it
represents an opportunity for students to provide extended evidence of what they can
do in one or more languages and in a range of areas. What is included in a portfolio
may be pre-specified or left to individuals to select, or a combination of both. It may
be collected over an extended period of time.
Below is a range of items which could be included in a graduating students
language portfolio. Please consider each element and indicate whether it should be a
compulsory or optional element of the portfolio, or whether it should not be included
at all.
Alternative elements which could be included in a Portfolio
Compulsory Optional Not to be
included
Rsum (C.V.)
Video introducing the student
Written introduction of the student
Examples of business correspondence
Examples of project work
Written commentary on a current affairs issue
Academic writing (e.g. essays, reports, etc.
marked by faculty)
Academic writing (e.g. essays, reports, etc.
produced for English enhancement classes)
Writing done under timed conditions in class
Video of a formal oral presentation
Video of other oral skills, e.g. group
discussions, role plays, etc.
Self-assessment of students language skills
Peer assessment of students language skills
Teacher assessment of students
language skills
HKEA Use of English grade
Language scores/grades achieved at university
Record of any language-related work or
formal courses taken
Record of language experiences outside
Hong Kong (e.g. on holiday or travelling)
Can you think of other examples of work that you would like included in a portfolio.
Please specify and indicate whether each should be optional or compulsory:
Compulsory Optional Not to be
included
Examples
90
4 European solutions to non-European problems
III. Tests
Another way of showing students language proficiency is by simply providing a test
score. Below are three alternatives, all of which are being considered. Please rank these
according to which you think would provide the most useful information to you for
recruitment purposes (1 = most useful and 3 = least useful)
University-specific test (each institution sets and marks its own tests) ___
Hong Kong-wide test (one test for all graduating students in Hong Kong) ___
International test (a test developed outside Hong Kong and widely recognised
throughout the world, e.g. IELTS; TOEFL) ___
V. Additional Comments
Please add any further comments you have about the issue of students demonstrating
their English language proficiency on graduation.
_____________________________________________________________________
_____________________________________________________________________
_____________________________________________________________________
Thank You
91
4 European solutions to non-European problems
Endnotes
i Language Enhancement Grants were initiated in response to a recommendation in
the Education Commissions third report (ECR3), June 1988: 85. The eight
institutions are: City University of Hong Kong, Hong Kong Baptist University,
Hong Kong Institute of Education, Lingnan University, The Chinese University
of Hong Kong, The Hong Kong Polytechnic University, The Hong Kong
University of Science and Technology and The University of Hong Kong.
ii Falvey and Coniam, 2000, discuss aspects of the development of the LPAT;
Shohamy, 2000 offers an alternative perspective on this initiative
iii The four tests are the Business Language Testing Service (BULATS)
see http://www.ucles.org.uk, English Language Skills Assessment (ELSA) see
http://www.lccieb.com/LCCI/ Home/ELSA.asp/, Test of English for International
Co-operation (TOEIC) see http://www.toeic.com/ and Pitmans Tests of English
for Speakers of Other Languages (EOS1; EOS2) see http://www.city-and-
guilds.com.hk/pq/wpestart.htm
iv The Chief Executive of the HKSAR, in his first Policy Address on October 8
1997, announced that he hoped universities would consider introducing a
requirement for all graduating students to take proficiency tests in English and
Chinese. See http://www.info.gov.hk/ce/speech/cesp.htm, Section E, Paragraph
93.
v For a complete description of the current version of the GSLPA-English,
see http://www.engl.polyu.edu.hk/ACLAR/projects.htm#GSLPAdevt
vi This is a public examination run by the Hong Kong Examinations Authority,
which students have to pass to enter university. See http://www.hkea.edu.hk
92
Validating questionnaires to
James E. Purpura
Teachers College, Columbia University
Introduction
It has long been established that individual learner characteristics may
contribute differentially to a students ability to acquire a second language
(Skehan 1989, 1998). Similarly, it has been shown that certain test-taker
characteristics, apart from that of communicative language ability, may also
influence the degree to which test takers are able to perform optimally on
language tests (Bachman 1990). Witness to this is a large body of research
demonstrating that, aside from language knowledge, the personal attributes of
test takers have a significant effect on test-score variation. Most of this
research has focused on the relationship between test takers demographic
characteristics and their performance on tests. Such studies have examined
performance with relation to age (e.g. Farhady 1982; Spurling and Illyin 1985;
Zeidner 1987), gender (e.g. Farhady 1982; Kunnan 1990; Ryan and Bachman
1990; Sunderland 1995; Zeidner 1987), cultural background (e.g. Brire 1968;
Farhady 1979; Zeidner 1986, 1987), and language background (e.g. Alderman
and Holland 1981; Brown 1999; Brown and Iwashita 1996; Chen and Henning
1985; Elder 1995; Farhady 1982; Ginther and Grant 1997; Kunnan 1990,
1995; Oltman et al. 1988; Ryan and Bachman 1990; Swinton and Powers
1980).
A second body of research has looked at the relationship between test
takers topical knowledge and their performance on language tests. Many of
these studies have investigated performance in relation to test takers
academic background or their prior knowledge (e.g. Alderson and Urquhart
1985; Clapham 1993, 1996; Fox et al. 1997; Ginther and Grant 1997; Jensen
and Hansen 1995; Tedik 1990).
A third set of studies has examined the socio-psychological and strategic
characteristics of test takers and their performance on tests. These studies have
examined performance in relation to cognitive styles such as field dependence
93
5 Validating questionnaires to examine personal factors in L2 test performance
and independence (e.g. Chapelle 1988; Hansen and Stansfield 1984, Stansfield
and Hansen 1983), attitudes toward language learning (e.g. Clment and
Kruidenier 1985; Gardner 1985, 1988; Zeidner and Bensoussan 1988);
motivation and the degree to which test takers are willing to devote time and
effort to language learning (e.g. Clment and Kruidenier 1985; Drnyei 1990;
Drnyei and Schmidt 2001; Gardner 1985, 1988; Gardner and Lambert 1972;
Kunnan 1995), level of anxiety (e.g. Brown et al. 1996; Bensoussan and
Zeidner 1989), and the test takers capacity to use cognitive and metacognitive
strategies effectively (Anderson et al. 1991; Purpura 1999; Vogely 1995).
These socio-psychological and strategic factors, alone or in combination with
other personal attributes, may have a significant impact on test scores,
suggesting that language knowledge may be a necessary, but in fact not a
sufficient, condition for good language test performance.
Given the potential role of these factors in second-language learning and
assessment, researchers must continue to investigate the nature of learner
characteristics and their potential effects on learning outcomes. They must
also examine how these attributes interact with each other, and how their
simultaneous effect contributes to test-score variation; otherwise, the very
constructs we wish to measure may be masked.
Prior to examining these relationships, however, valid and reliable
instruments designed to measure these attributes must be developed. One well
established method for assessing test-taker characteristics is the questionnaire.
Questionnaires allow for a high degree of control over the probes; they can be
easily designed to measure multiple constructs simultaneously; they can be
administered to large groups of examinees; they lend themselves to statistical
analysis; and they reveal systematic patterns of behaviour in large amounts of
data that might otherwise have gone unnoticed. However, questionnaires are
notoriously sensitive to small differences in wording (Allan 1995); they often
show cross-measurement of content, producing substantial redundancy and
correlated measurement error (Byrne 1998; Purpura 1998, 1999); and they
produce over- or underestimates of the data. Given these problems, it is
important that the construct validity of questionnaires be thoroughly
investigated prior to their use in research or their application to learning, and
validation efforts need to be substantially and methodologically rigorous.
Otherwise, the inferences drawn from the use of these instruments may be
unfounded and misleading.
While the development and validation of instruments purporting to measure
these personal attributes are a critical first step in examining the relationships
between personal factors and performance, most questionnaires currently
being used have not been submitted to such rigorous validation procedures. In
fact, most researchers report no more than an assessment of the
questionnaires internal consistency reliability. However, an increasing
number of researchers (e.g. Gardner 1985b; Oxford et al. 1987; Purpura 1997,
94
5 Validating questionnaires to examine personal factors in L2 test performance
1998) carried these analyses a step further by examining the underlying factor
structure of their questionnaires by means of exploratory factor analysis. No
study in our field, however, has used a confirmatory approach to examining
the factor structure of items in a questionnaire. In the current study, item-level
structural equation modelling (SEM) has been used as a means of examining
the underlying psychometric characteristics of questionnaire surveys.
The current paper presents the preliminary findings of an on-going study
aimed at examining the construct validity of a battery of questionnaires
designed to measure selected socio-psychological and strategic background
characteristics of test takers. I will first describe these measures and the
theoretical constructs underlying their construction. I will then discuss the
process used to examine the construct validity of these instruments, using
item-level structural equation modelling, and how these validation efforts
have informed decisions about tailoring the instruments prior to their
computerisation and use in research and learning contexts.
95
5 Validating questionnaires to examine personal factors in L2 test performance
Table 1
Taxonomy of the original language learning questionnaires
A. Attitudes Questionnaire 41
Attitudes toward English speakers 13
Attitudes toward learning English 8
Interest in foreign languages 8
Perception of task difficulty 12
B. Motivation Questionnaire 57
Integrative motivation 17
Instrumental motivation 12
Achievement motivation general learning 4
Achievement motivation language learning 11
Achievement motivation general testing 7
Achievement motivation language testing 6
C. Effort Questionnaire 11
D. Anxiety Questionnaire 30
Class anxiety 9
Language anxiety 9
Test anxiety 12
SUB-TOTAL 139
A. Cognitive Strategies 34
Clarifying/verifying 2
Inferencing 2
Summarising 2
Analysing inductively 3
Associating 4
Linking with prior knowledge 4
Repeating/rehearsing 5
Applying rules 3
Practising naturalistically 5
Transferring from L1 to L2 4
B. Metacognitive Strategies 21
Assessing the situation (planning) 6
Monitoring 4
Evaluating 5
Self-testing 6
SUB-TOTAL 55
TOTAL 194
96
5 Validating questionnaires to examine personal factors in L2 test performance
97
5 Validating questionnaires to examine personal factors in L2 test performance
also shown that success can also be attributed to the learners perception of the
task as being easy, while failure stems from the perception that the task
demands appear unreasonable (McCombs 1991).
Based on this research, the Attitudes Questionnaire in the current study
included four scales: attitudes toward speakers of English, attitudes towards
learning a foreign language, and interest in learning a foreign language.
Finally, because we felt that test takers perception of the language as being
difficult to learn might be an important factor, we included perception of task
difficulty as one of the scales.
The design of the Motivation Questionnaire was also rooted in Gardners
AMTB. In this case, we included instrumental and integrative motivation as
scales in the current instrument. We were equally influenced by Weiners
(1979) notion of effort as having motivational consequences resulting in
success. According to Weiner (1979), success was a result of hard work, while
failure was due to a lack of effort. Consequently, those who believe they have
some degree of control over their success seem to exert more effort in pursuit
of their goals. As achievement motivation and effort seemed to be potentially
important factors in language learning we included these scales in the
questionnaires. Achievement motivation refers to beliefs and opinions about
ones ability to achieve, while effort refers to the concrete actions a learner
is willing to do to achieve.
The final socio-psychological factor questionnaire was designed to measure
anxiety, a condition which may undermine language learning or test
performance. The AMTB defined anxiety in terms of the language class.
However, FCE candidates may also experience anxiety associated with using
the language in real-world communicative situations, where fears may surface
as a result of lack of adequate linguistic control or lack of familiarity with the
norms and expectations of the target culture. Also, the FCE candidates may
experience anxiety related to taking language tests. As a result, the Anxiety
Questionnaire sought to measure three types of anxiety: language class
anxiety, language anxiety and test anxiety, as seen in Table 1.
The development of the strategic factors questionnaire battery was rooted
in Gagn, Yekovich and Yekovichs (1993) model of human information
processing and was influenced by several second-language strategy
researchers. As the development of the cognitive and metacognitive strategy
questionnaires in the LLQs is well documented in Purpura (1999), I will not
duplicate that discussion here. Briefly, the Cognitive Strategy Questionnaire
was designed to measure a number of comprehending, memory and retrieval
strategies, while the Metacognitive Strategy Questionnaire aimed to measure
a set of appraisal strategies such as monitoring and evaluating. For reasons of
space, I will also not discuss the communication strategies questionnaire.
Once the questionnaires outlined in Table 1 were developed, they were
piloted with students around the world (see Bachman, Cushing and Purpura
98
5 Validating questionnaires to examine personal factors in L2 test performance
Study participants
The Attitudes and Anxiety Questionnaires were administered by the EFL
Division of UCLES to 207 ESL students studying on summer courses in
centres around Britain. In common with the Cambridge FCE candidature,
99
5 Validating questionnaires to examine personal factors in L2 test performance
Statistical procedures
Descriptive statistics for each questionnaire item were computed and
assumptions regarding normality examined. Items that did not meet these
assumptions were considered for removal. Also, items whose means and
medians were not within one point of the outer bounds of the scales or whose
standard deviation was lower that 1.0 were also considered for removal.
The data were then submitted to a series of reliability analyses. Internal
consistency reliability estimates were computed for the individual
questionnaire scales. Items with low item-total correlations were dropped or
moved to other scales.
Then, each questionnaire was submitted to a series of exploratory factor
analyses (EFA) so that patterns in the observed questionnaire data could be
examined, and latent factors identified. Items that loaded on more than one
variable were flagged for removal, and items that loaded with different sub-
scales were considered for change.
Although EFA is a useful statistical procedure for questionnaire validation,
it does not have the power of confirmatory factor analysis (CFA). First, EFA
procedures assume no a priori patterns in the data, thereby making no a priori
constraints on the underlying constructs. In my opinion, this process is out of
step with the way the current questionnaires were designed. These instruments
were hardly a random assembly of related items. Rather, they were
constructed in accordance with a number of principles and studied all along
the way. By design, the items were intended to measure one scale and not
another. In short, a number of a priori constraints were, in fact, imposed on
questionnaires by virtue of the design process. As CFA seeks to determine the
extent to which items designed to measure a particular factor actually do so, I
felt a confirmatory approach to validation was more appropriate.
Secondly, EFA, as a statistical procedure, is unable to tease apart
measurement error from the observed variables, and is unable to determine
100
5 Validating questionnaires to examine personal factors in L2 test performance
101
5 Validating questionnaires to examine personal factors in L2 test performance
Findings
Descriptive statistics
Table 2 presents the summary descriptive statistics for the items in the current
study. Most items were within the accepted limits in terms of central tendency,
variation and normality. During the course of the analyses, some of the items
seen below were dropped and some scales (as indicated below) were merged.
Internal consistency reliability estimates for each scale were all in the
acceptable range. They are also presented in Table 2.
102
5 Validating questionnaires to examine personal factors in L2 test performance
103
5 Validating questionnaires to examine personal factors in L2 test performance
Alpha = .800
104
5 Validating questionnaires to examine personal factors in L2 test performance
Alpha = .852
105
5 Validating questionnaires to examine personal factors in L2 test performance
.43 AES4
.6 AES12
.57 AES16
.47
AES23
.77
AES26
Attitudes .67
toward English AES36
speakers .64
AES37
.68
AES40
.54
AES42
.43
AES7R
.50
AES21R
106
5 Validating questionnaires to examine personal factors in L2 test performance
.78+ .62
.69 .72+ Attitudes Attitudes AES 26 E26
E30 IFL 30 toward toward
.66 .75
Learning Learning AES 36 E36
.67 .75 English English
ALE S1 .64 .77
E31 E37
AES 37
.52 .68
.72
.85 .36 AES 40 E40
E28R ALE .56
.83
AES 42 E42
Perception .52
of Task .86
+ = Fixed path AES 21R E21R
Difficulty
Chi Sq = 314.6
df = 228
BBNFI = 0.82
BBNNFI = 0.94
CFI = 0.94
RMSEA = 0.04
107
5 Validating questionnaires to examine personal factors in L2 test performance
AES26 The people who speak this language are friendly. .772 .827
AES40 The people who speak this language are fun to be with. .682 .715
AES36 The people who speak this language are welcoming. .666 .754
AES12 The people who speak this language make good friends. .642 .672
AES37 The people who speak this language are interesting to talk to. .639 .669
AES16 The people who speak this language are warm-hearted. .566 .610
AES42 The people who speak this language are open to
people from other cultures. .541 .594
AES21R The people who speak this language are boring. .505 .505
AES23 The more I get to know the people who speak this language,
the more I like them. .468 .468
AES7R It is difficult to have close friendships with people
who speak this language. .434 .434
AES4 People who speak this language are honest. .430 .430
108
5 Validating questionnaires to examine personal factors in L2 test performance
language class anxiety (LCanx), measured by six items, and test anxiety
(Tanx), measured by six items. These two components of anxiety were
hypothesised to be correlated, but again no correlated errors were postulated.
This model produced a Chi-square of 89.89 with 53 degrees of freedom, a
CFI of .954, and a RMSEA of .059, indicating that the data provided a good
fit for the model. All parameters were statistically significant and
substantively viable. The parameter estimates were sufficiently high. This
model produced a moderate, statistically significant (at the .05 level)
correlation (.52) between language class anxiety and test anxiety, indicating
that some test takers who felt anxious in their language classes also tended to
feel nervous about language tests. These results are presented in Figure 3.
,61 ,70
E48 TANX48 .79 .71 CANX50 E50
Chi Sq = 89.89
df = 53
BBNFI = 0.90
BBNNFI = 0.94
CFI = 0.954
RMSEA = 0.059
Based on these results, we were again able to use the parameter estimates
to rank-order the items in each measurement model from strongest to weakest
indicator of the underlying factors. This again allowed us to provide a long and
a short version of the questionnaires.
Following these analyses, the following revised taxonomy of socio-
psychological factors was produced (see Table 4).
109
5 Validating questionnaires to examine personal factors in L2 test performance
A. Attitudes Questionnaire 24
Attitudes towards English Speakers 11
Attitudes towards Learning English
(merged with Interest in Foreign Languages) 6
Perception of Task Difficulty 7
B. Anxiety Questionnaire 13
Language Class Anxiety (Language Anxiety
was merged with Class Anxiety) 7
Test Anxiety 6
Conclusion
The purpose of this study was to describe the process used to validate a bank
of language learning questionnaires designed to measure selected personal
attributes of the Cambridge candidature. Although these procedures were
performed on all the questionnaires in the battery, this study reported only on
analyses performed on the attitudes and anxiety questionnaires.
The factorial structure of each questionnaire component was modelled
separately by means of item-level structural equation modelling. Items that
performed poorly were removed. Then, all the components of the
questionnaire were modelled simultaneously. With both the attitudes and
the anxiety questionnaires, two components appeared to be measuring
the same underlying construct. Consequently these components were
merged, providing a more parsimonious model of the underlying constructs.
Once there was evidence that these models fitted the data well and were
substantively viable, the results were used to provide a rank-ordering of
the factor loadings associated with each item so that long and short versions
of the questionnaires could be produced and subsequently delivered over
the computer. In this case, the questionnaires could be customised to provide
the strongest indicators of each factor.
These analyses also provided information on the relationship between the
underlying factors in the questionnaire. This proved invaluable in fine-tuning
the instruments as it allowed us to merge scales when statistically and
substantively justifiable. In the attitudes questionnaire, the results showed
some correlation between the learners attitudes towards learning English and
their attitudes towards English speakers, while an inverse relationship was
observed between the students perception of how difficult the language was
110
5 Validating questionnaires to examine personal factors in L2 test performance
to learn and their attitudes towards learning English. Then in the Anxiety
Questionnaire a moderate relationship was observed between test takers who
felt anxious speaking English in class and those who felt anxious taking
language tests.
In sum, these results supplied invaluable information on the underlying
structure of the questionnaires. Also, by modelling the different components
of the questionnaires simultaneously, they provided a better understanding of
how the respective components of the questionnaire interacted with one
another, providing substantive insights regarding the test takers personal
attributes. In this respect, item-level SEM proved invaluable as an analytical
tool for questionnaire validation.
Acknowledgments
Earlier versions of this paper were presented at the 2001 Language Testing
Research Colloquium in Saint Louis and at the 2001 ALTE Conference in
Barcelona. I would like to thank Nick Saville from UCLES for discussing at
these venues how the Language Learning Questionnaires fit into UCLES
Validation plans and how they have been computerised. I am also very
grateful to Mike Milanovic and Nick Saville at UCLES for their continued
support and encouragement over the year in pursuing this project. Finally, I
would like to thank Lyle Bachman and Sara Cushing Weigle for their
expertise and inspiration in developing the original version of these
questionnaires.
References
Alderman, D. and P. W. Holland. 1981. Item performance across native
language groups on the Test of English as a Foreign Language. Princeton:
Educational Testing Service.
Alderson, J. C. and A. H. Urquhart. 1985. The effect of students academic
discipline on their performance on ESP reading tests. Language Testing 2:
192204.
Allan, A. 1995. Begging the questionnaire: Instrument effect on readers
responses to a self-report checklist. Language Testing 12: 2, 133156.
Anderson, N. J., L. Bachman, K. Perkins and A. Cohen. 1991. An exploratory
study into the construct validity of a reading comprehension test:
Triangulation of data sources. Language Testing 8: 4166.
Atkinson, L. 1988. The measurement-statistics controversy: Factor analysis
and subinterval data. Bulletin of the Psychonomic Society 26: 4, 361364.
Bachman, L. F. 1990. Fundamental Considerations in Language Testing.
Oxford: Oxford University Press.
111
5 Validating questionnaires to examine personal factors in L2 test performance
112
5 Validating questionnaires to examine personal factors in L2 test performance
113
5 Validating questionnaires to examine personal factors in L2 test performance
114
5 Validating questionnaires to examine personal factors in L2 test performance
115
Legibility and the rating of
6 second-language writing
Annie Brown
University of Melbourne
Introduction
Just as the advent of new technologies, particularly the computer, has had a
major impact on the delivery of language programs, with an upsurge in
distance-learning programs and independent-learning (CD Rom-based)
programs (e.g. Commonwealth of Australia 1999 and 2001), so too is its
impact beginning to be seen in increasing use of computers for the delivery of
tests. Although computers were first used in testing because they allowed for
the application of IRT in computer-adaptive tests (e.g. Weiss 1990; Chalhoub-
Deville and Deville 1999), more recently they have been used for the delivery
of non-adaptive tests also. In the European context in particular, one major
European project, DIALANG (e.g. Alderson 2001), aims to deliver a battery
of tests via computer for a number of European languages.
Given the widespread interest in computer-based or web-delivered testing
(see Roever 2001), it is particularly important to investigate the impact of the
technology on test performance. In the context of a move to computer-delivery
of TOEFL, Kirsch et al. (1998), for example, investigated the effect of
familiarity with computers on test takers performances. This paper is
concerned with another aspect of construct-irrelevant variance, namely the
relationship of scores awarded on second-language writing tests to essays that
have been handwritten vis--vis those that have been word-processed.
Handwriting and neatness of presentation has long been seen as a
contaminating factor in the assessment of writing ability, and the impact of
handwriting on overall judgements of writing quality has been the focus of a
number of studies in the area of first-language writing assessment. Some of
these studies involved correlations of teacher-assigned ratings of writing
quality with independent judgements of handwriting (e.g. Stewart and Grobe
1979; Chou, Kirkland and Smith 1982), whereas others involved experimental
designs where the same essays are presented to raters in different presentation
formats involving good handwriting, poor handwriting and, in some cases,
typed scripts (Chase 1968; Marshall and Powers 1969; Briggs 1970; Sloan and
117
6 Legibility and the rating of second-language writing
McGinnis 1978; Bull and Stevens 1979; McGuire 1996). The findings indicate
in general that the quality of handwriting does have an impact on scores, and
that increased legibility results in higher ratings; in all the studies except that
by McGuire (1995), the essays with better handwriting or the typed scripts
received higher scores.
Given the great interest over the years in handwriting and its impact on
assessments of writing proficiency within the field of first-language literacy,
it is surprising that there are hardly any studies of the effect of handwriting in
the assessment of second-language writing. One study involving essays
written by non-native speakers (Robinson 1985) produced similar findings to
the majority of the first-language writing studies; essays written by students
whose L1 did not use the Roman alphabet tended to receive lower scores than
essays written by expert writers.
The lack of research into the impact of handwriting on assessments of L2
writing proficiency is all the more surprising in a field where reliability and
validity issues are generally well understood, and where much attention is paid
in the research literature to identifying and examining the impact of construct-
irrelevant variance on test scores. One could argue that it is particularly
important in formal L2 writing test contexts to examine and evaluate the
impact of extraneous variables such as handwriting and presentation, because
it is often on the basis of such tests that decisions regarding candidates future
life or study opportunities are made. Moreover, it is particularly in writing
contexts such as these, where writers typically have to write under
considerable time pressure, that it may be most difficult for them to control the
quality of handwriting and general neatness of layout. It is rare, for example,
in formal tests that writers have time to transcribe a draft of the essay into a
more legible and well presented script. Also, as Charney (1984) points out, in
a test context the constraints imposed on the rater may result in handwriting
playing a larger part in the assessment than it should. He argues that the
assessment constraints limited time and multiple assessment focuses mean
that raters have to read essays rapidly and this may force them to depend on
those characteristics [such as handwriting] in the essays which are easy to pick
out but which are irrelevant to true writing ability.
It is, perhaps, natural to assume that the same situation would hold for
assessments of L2 writing as for L1 writing, that is, that poor handwriting
would have a negative impact upon scores. Such an expectation seems logical
a paper that looks good and is easy to read is likely to create a better
impression on a rater than one which is messy or difficult to read. Chou et al.
(1982), for example, point out that crossings-out and re-sequencing of pieces
of text may be interpreted as being indicative of a student who is unprepared
for writing and unsure of how to sequence his or her ideas; they seem to
contend that it may not simply be that poor writing is difficult to process (and
therefore assess) but also that raters may make negative inferences about the
118
6 Legibility and the rating of second-language writing
Methodology
The salience of legibility as a factor in raters judgements was examined
within a controlled experimental study in which a comparison was made of
scores awarded to scripts which differed only in relation to the variable
handwriting. The essay data was gathered using the IELTS Task Two essay.
On the basis of previous studies in L1 contexts (see above), it was
hypothesised that scores awarded to the handwritten and typed versions of the
essays would be significantly different, with higher scores being awarded to
the typed versions. In addition it was hypothesised that the score differences
would be greater for those scripts where the handwritten version had
particularly poor legibility.
Forty IELTS scripts were selected at random from administrations held at
one test centre within a one-year period. The scripts were selected from five
different administrations and involved five different sets of essay prompts.
Each of the Task Two essays was retyped. Original features such as
119
6 Legibility and the rating of second-language writing
punctuation, spelling errors and paragraph layout were retained, but aspects of
text editing that would be avoided in a word-processed essay, such as
crossings-out, insertions and re-orderings of pieces of text, were tidied up.
Next, in order to produce stable and comparable ratings for the two script
types, that is, handwritten and typed (henceforth H and T), each essay was
rated six times. In order to ensure that ratings awarded to an essay in one script
type did not affect scores awarded to the same essay in the other format, raters
did not mark both versions of the same essay. Rather, each of twelve
accredited IELTS raters involved in the study rated half of the typed scripts
and half of the handwritten scripts, each being from a different candidate.
Although in operational testing it is left to the discretion of raters as to
whether they rate the essays globally or analytically, for the purposes of this
study, in order to investigate whether poor legibility had most impact on one
particular assessment category, the raters were instructed to assess all the
essays analytically. Thus, ratings were awarded to each script for each of the
three Task Two analytic categories: Arguments, Ideas and Evidence,
Communicative Quality, and Vocabulary and Sentence Structure. A final
overall band score was calculated in the normal way, by an averaging and
rounding of the three analytic scores. Raters also took the length of each essay
into account in the usual way.
In addition to the IELTS ratings, judgements were made of the legibility of
each handwritten script. A six-point scale was developed specifically for the
purposes of this study. Drawing on discussions of legibility in verbal report
studies such as those discussed above, legibility was defined as a broad
concept which included letter and word formation, general layout (spacing,
paragraphing and lineation), and editing and self-correction. The four judges
(all teachers of writing in first- or second-language contexts) were given
written instructions to accompany the scale.
Results
Table 1 shows the mean scores for both the analytic and overall score
categories for each version (H and T) of each essay. It shows that both the
analytic and the overall scores were on average marginally higher for the
handwritten scripts than for the typed scripts. The handwritten scripts
achieved a mean rating of 5.30 as opposed to 5.04 for typed scripts for
Arguments, Ideas and Evidence (AIE), 5.60 as opposed to 5.34 for
Communicative Quality (CQ), 5.51 as opposed to 5.18 for Vocabulary and
Sentence Structure (VSS), and an Overall Band Score (OBS), averaged across
the three categories, of 5.48 as opposed to 5.17. The spread of scores was
similar for both types of script. Although one might expect the score
difference to be least marked for Arguments, Ideas and Evidence, as this
category is the least concerned with presentation issues, and most marked for
120
6 Legibility and the rating of second-language writing
In order to investigate the significance of the score differences for the two
script types for each rating category, a Wilcoxon matched-pairs signed-ranks
test was carried out (see Table 2). As can be seen, although the difference in
mean scores is relatively small (0.26 for AIE and CQ, 0.33 for VSS, and 0.27
for OBS), it is nonetheless significant for all rating categories.
The second analysis looked more narrowly at the impact of different
degrees of legibility on ratings. On the basis of findings within the L1 writing
assessment literature, it was considered likely that the score differences across
the two script types (H and T) would be insignificant for highly legible scripts
but significant for ones that were difficult to decipher. A comparison was
made of the score differences for the ten essays judged to have the best
legibility and the ten judged to have the worst (see Appendix for examples of
handwriting).
Table 2 shows the average score difference for the two script types for each
set of ten essays. As expected, the score difference between the H and T
versions for the candidates with the best handwriting was found to be
121
6 Legibility and the rating of second-language writing
relatively small (ranging from 0.05 to 0.17 of a band), whereas for those with
the worst handwriting, it was somewhat larger (ranging from 0.5 to 0.62, i.e.
at least half a band).
A Wilcoxon matched-pairs signed-ranks test was carried out in order to
determine the significance of the score differences between the two script
types for each group. For the scripts with the best handwriting, none of the
differences were significant. For those with the worst handwriting, AIE was
not significant but the other three categories were: CQ at the .05 level, and
VSS and OBS at the .01 level.
Discussion
In summary, the analysis found that, as hypothesised, there was a small but
significant difference in the scores awarded to typed and handwritten versions
of the same essay. Also as expected, the score difference between handwritten
and typed essays was greater for essays with poor legibility than for those with
good legibility, being on average less than 0.1 of a band for the well written
essays but slightly over half a band for the poorly written ones. However,
contrary to expectations, it was the handwritten scripts that scored more
highly, and the handwritten scripts with poor legibility that had the greatest
score difference between versions. In effect, this means that, rather than being
disadvantaged by bad handwriting and poor presentation, test candidates are
advantaged.
It is interesting to reflect more closely on why the findings here differ from
those found in most studies of first-language writing. As noted earlier, a major
difference in the rating of L1 and L2 writing is that in L2 assessments there is
a stronger emphasis on mechanics or linguistic features (syntax, grammar
and vocabulary) (Cumming 1998). It may be, then, that poor legibility has the
effect of masking or otherwise distracting from these sorts of errors. In formal
L2 proficiency tests, raters usually have limited time to mark each essay. They
also have multiple assessment focuses which demand either multiple readings
or a single reading with attention being paid simultaneously to different
features. Given this, it may be that the extra effort required to decipher
illegible script distracts raters from a greater focus on grammar and accuracy,
so that errors are not noticed or candidates are given the benefit of the doubt
when raters have to decide between two scores. The corollary of this, of
course, is that errors stand out more or are more salient when the essay is typed
or the handwriting is clear.
It may also be, of course, that presentation problems inhibit fluent reading
to the extent that the quality not only of the grammar, but also of the ideas and
their organisation (the coherence of the script), is hard to judge. One rater, for
example, commented that she found she had to read essays with poor legibility
more carefully in order to ensure that she was being fair to the candidate.
122
6 Legibility and the rating of second-language writing
Perhaps when raters do not have the time to read very closely (in operational
testing sessions), and the handwriting makes it difficult to read the essay, it
may be that they (consciously or subconsciously) compensate and award the
higher of two ratings where they are in doubt, in order to avoid discriminating
against the candidate. That raters compensate to avoid judging test candidates
unfairly appears to be a common finding they have been found to
compensate, for example, for the difficulty of the specific writing task
encountered and for perceived inadequacies in the interviewer in tests of
speaking.
A third consideration, and one that arose from a later review of the texts
judged as the most legible, concerned the sophistication of the handwriting.
In the context of this study many of the texts had been produced by learners
from non-Roman script backgrounds, and the handwriting in the legible
texts, although neat, was generally simple printing of the type produced by
less mature L1 writers; indeed it was similar to childrens writing (see
Appendix). Although there is no evidence of this in what the raters say, it may
be that they unconsciously make interpretations about the sophistication or
maturity of the writer based on their handwriting, which, in the context of a
test designed as a university screening test, might affect the scores awarded.
This question would require further investigation, as the tests in this study
were rated only for neatness, not sophistication.
What this study indicates, other than that poor handwriting does not
necessarily disadvantage learners of English in a test context, is that
alternative presentation modes are not necessarily equivalent. Given moves in
a number of large-scale international tests to ensure that test administrations
can be carried out as widely as possible, alternative delivery or completion
modes are often considered (see, for example, OLoughlin 1996). This study
shows that, before the operational introduction of alternative modes of testing,
research needs to be undertaken into the score implications of such a move in
order to determine whether the assumed equivalence actually holds.
Finally, this study indicates a need for further research, not only in terms of
replicating the current study (to see whether these findings apply in other
contexts) but also in terms of other variables which might arise when a test is
administered in two modes. This study, like most of the earlier experimental
studies, is concerned with directly equivalent scripts; in investigating the
question of legibility it deals only with the same script in different formats; it
does not deal with the larger question of differences in composing in the two
modes, yet the availability of spell- and grammar-checkers (along with any
other sophisticated aids to composing that word-processing software may
provide) imposes additional variables which are not within the scope of the
present study. It would also be of interest, then, to compare the scores awarded
to essays produced in the two formats, pen-and-paper and word-processed, by
the same candidates.
123
6 Legibility and the rating of second-language writing
References
Alderson, J. C. 2001. Learning-centred assessment using information
technology. Symposium conducted at the 23rd Language Testing Research
Colloquium, St Louis, MO, March 2001.
Briggs, D. 1970. The influence of handwriting on assessment. Educational
Research 13: 5055.
Brown, A. 2000. An investigation of the rating process in the IELTS Speaking
Module. In R. Tulloh (ed.), Research Reports 1999, Vol. 3 (pp. 4985).
Sydney: ELICOS.
Bull, R. and J. Stevens. 1979. The effects of attractiveness of writer and
penmanship on essay grades. Journal of Occupational Psychology 52: 1,
5359.
Chalhoub-Deville, M. and C. Deville. 1999. Computer-adaptive testing in
second language contexts. Annual Review of Applied Linguistics 19:
273299.
Charney, D. 1984. The validity of using holistic scoring to evaluate writing: a
critical overview. Research in the Teaching of English 18: 6581.
Chase, C. 1968. The impact of some obvious variables on essay test scores.
Journal of Educational Measurement 5: 315318.
Chou, F. J., S. Kirkland and L. R. Smith. 1982. Variables in college
composition (Eric Document Reproduction Service No. 224 017).
Commonwealth of Australia. 1999. Bridges To China. A web-based
intermediate level Chinese course. Canberra: Commonwealth of Australia.
Commonwealth of Australia. 2001. Bridges To China CD Rom version. A
web-based intermediate level Chinese course. Canberra: Commonwealth of
Australia.
Cumming, A. 1990. Expertise in evaluating second language compositions.
Language Testing 7: 3151.
Cumming, A., R. Kantor and D. Powers. 1998. An investigation into raters
decision making, and development of a preliminary analytic framework, for
scoring TOEFL essays and TOEFL 2000 prototype writing tasks.
Princeton, NJ: Educational Testing Service.
Huot, B. 1988. The validity of holistic scoring: a comparison of the talk-aloud
protocols of expert and novice raters. Unpublished dissertation, Indiana
University of Pennsylvania.
Huot, B. 1993. The influence of holistic scoring procedures on reading and
rating student essays. In M. Williamson and B. Huot (eds.), Validating
holistic scoring for writing assessment: Theoretical and empirical
foundations (pp. 206236). Cresskill, NJ: Hampton Press.
Kirsch, I., J. Jamieson, C. Taylor and D. Eignor. 1998. Computer familiarity
among TOEFL examinees: TOEFL Research Report 59. Princeton, NJ:
Educational Testing Service.
124
6 Legibility and the rating of second-language writing
125
6 Legibility and the rating of second-language writing
Appendix 1
126
6 Legibility and the rating of second-language writing
127
Modelling factors affecting oral
Barry OSullivan
University of Reading
Background
OSullivan (1995, 2000a, 2000b, 2002) reported a series of studies designed
to create a body of empirical evidence in support of his model of performance
(see Table 1 for an overview of these studies). The model sees performance on
oral proficiency tests (OPTs) as being affected by a series of variables
associated with the test taker, the test task and the interlocutor. The first three
studies, included in Table 1, focused on particular variables, isolated under
experimental conditions, and found mixed evidence of significant and
systematic effects. An additional study (OSullivan 2000b), designed to
explore how the variables might interact in an actual test administration, failed
to provide evidence of any significant interaction.
The study reported here was designed to further explore the effect of
interactions amongst a series of variables on OPT performance (as represented
by the scores awarded). Therefore, the hypothesis tested in this study can be
stated as
In a language test involving paired linguistic performance, there will be a
significant (<.05) and systematic interaction between the variables
relative age and gender, acquaintanceship, perceived relative personality
and perceived relative language ability.
Method
The Test takers
The results for a total of 565 candidates from three major European test
centres (Madrid, Toulouse and Rome) on the Cambridge First Certificate in
English (FCE) were included in this study. The candidates were representative
of the typical FCE population in terms of age, gender, educational background
and test experience, based on data provided from the Candidate Information
Sheets (CIS) completed by all UCLES Main Suite candidates.
129
7 Modelling factors affecting oral language test performance
Gender 12 Japanese Structured interview Quantitative data (FSI Sig. difference found
(OSullivan university format scale) two-way higher when interviewed
2000a) students Part 1 short answers repeated measure by a woman (though the
(6 men and Part 2 longer responses ANOVA Grammar criterion was
6 women, Interviewed twice, Qualitative data the only sig.)
average age once by a woman and accuracy & complexity Sig. diff. found for
approx. 20) once by a man (participants language); accuracy (but not
speech characteristics complexity)
(interviewer language) Sig. diff. found in the
language of the
interviewers
Acquaintanceship Phase 1 Personal info. Quantitative Wilcoxon Sig. diff. found (higher
(OSullivan 2002) 12 Japanese exchange Matched-Pairs Signed- with friend) actual
women (aged Narrative based on set Ranks on FSI scores difference of almost 10%
2122) of pictures (discounting two outliers)
Decision-making task
All performed once
with friend, once with
stranger
Multi-Variable 304 Turkish Task 1 Personal Quantitative Phase 1 Sig. main effect
Study students information exchange MANOVA, General for Partner Sex & Partner
(OSullivan (148 women, Task 2 Pair-work Linear Model on FCE Acquaintanceship (no
2000b) 156 men) (selection of items for scores, using responses interaction) higher when
holiday graphic to questionnaire partner is male, and a
given) (perception of partner) stranger
Task 3 Negotiation Analysed in 2 phases. Phase 2 no sig. diff.
for additional items observed for Personality
130
7 Modelling factors affecting oral language test performance
The examiners
In all, a total of 41 examiners took part in the test administration over the three
sites. All examiners were asked to complete an information sheet on a
voluntary basis, and all examiners who participated in this administration of
the FCE did so. The data collected indicated an even male/female divide, and
suggested that the typical examiner was an English teacher (most commonly
with a Diploma-level qualification), in the region of 40 years old, and with
extensive experience both as a teacher and of the FCE examination.
131
7 Modelling factors affecting oral language test performance
to support the argument that they are both either measuring a different aspect
of oral proficiency or are measuring the same ability, although from different
perspectives. Both of these figures are significant at the .01 level, and should
be considered satisfactory for this type of live test.
Results
Before performing the main analysis, it was first necessary to explore the data
from the Candidate Questionnaires (CQs), in order to establish the final
population for the proposed analysis. Table 2 contains the responses to the
items on the Candidate Questionnaire (CQ). From this table we can see that
the population mix is approximately 55% female and 45% male this mix is
reflected in the partner gender figures. Any difference in number is a
reflection of the fact that not all test takers in all centres are included in this
population for instance where they have not fully completed the CIS or CQ.
It is clear from this table that the five-level distinction for items does not
yield sufficiently large cell sizes for analyses across all levels. For this reason,
it was decided to collapse the extremes for each variable. The results of this
procedure are included in the table, in the rows entitled 3-levels.
5 Partner Lang. Level Much lower Lower Similar Higher Much higher
5-levels 1 45 400 116 4
3-levels 46 400 120
132
7 Modelling factors affecting oral language test performance
ANOVA Results
The decision to perform the initial ANOVA on the Overall Total Score
(OVTOT) was taken as this score represents the reported score, and, as such,
is the most important as far as test outcome is concerned. Through this
ANOVA it should be possible to identify any main effects and/or interactions
among the six independent variables thus identifying variables or
combinations of variables which appear systematically to affect performance
(as represented by the score achieved).
ANOVA Overall Total Score
Table 3 Analysis of variance for overall total score
133
7 Modelling factors affecting oral language test performance
From this interaction plot we can see that there is a clear difference between
the male and female test takers. While there is a very slight fall in the mean
score of male candidates relative to the perceived language level of their male
interlocutors, they appear to be achieving considerably higher scores when
paired with a woman whose language level they perceive to be lower than
their own (almost 2 points out of a possible 25 approximately 8% of the
134
7 Modelling factors affecting oral language test performance
range). On the other hand, the men appear to achieve similar scores with
women whom they consider to be at either the same or a higher language level
than themselves. Female test takers, on the other hand, seem to display wider,
and more significant, differences under the different conditions. Unlike their
male counterparts, the difference in mean scores achieved when working with
other female test takers at the three different language levels varies by up to
1.5 points (6% of the range). However, it is with male partners that the scores
really show a dramatic degree of variation. Here, there appears to be a
systematic lowering of the mean score as the perceived language level of the
male partner increases. The range of difference is approximately 6 points, or
24% of the possible range.
Candidate Gender * Partner Personality * Acquaintanceship
In the graphic representation of this interaction (see Figure 2 below) we can
see that the male and female test takers tend to achieve similar patterns of
mean scores when working with an acquaintance with similar differences in
scoring range also (approximately one point or 4% of the overall). However,
135
7 Modelling factors affecting oral language test performance
when it comes to the other conditions, it is clear from the graph that there are
very different patterns for the male and female candidates.
The data suggest that there is relatively little difference in mean scores
when the female test takers are working with a partner whom they perceive as
being more outgoing than themselves, irrespective of the degree of
acquaintanceship. The same can be said of partners perceived as being less
outgoing than themselves though there is a clear tendency for these test
takers to achieve scores that are approximately 2 points (8%) higher when
working with partners perceived as being more outgoing.
It is when working with a partner considered to be a friend, similar in
personality to themselves, that the mean scores achieved by the female test-
takers are the most variable with an overall range of mean scores of 5.5
points (22% of the overall score). In order to double-check that the data upon
which this chart is based were reliable, a review of the original data set was
made at this point. Although there was one instance of a very low-scoring test
taker among this group, the overall effect does not change dramatically if that
score is removed.
In contrast, the male test takers seem to achieve similar scores when
working with partners they perceive as less outgoing or similar in personality
to themselves, regardless of the degree of acquaintanceship. Here, the greatest
variation appears to be when the partner is perceived as being more outgoing
with a systematic increase in score of approximately 6 points (24%) from a
low of 16.52 with a friend, to 19.46 with an acquaintance, to a high of 22.72
with a stranger.
Partner Gender * Partner Age * Acquaintanceship
In the final 3-way interaction there is again a clear difference in mean scores
awarded under the different conditions (see Figure 3).
There appears to be a certain systematicity to the scores achieved by the test
takers when working with a male partner. While there is little difference when
the partner is younger than the test taker (irrespective of the degree of
acquaintanceship), there is some considerable difference when the partner is
older. Though this is true of all conditions, the difference in mean score
between working with a younger friend and working with an older friend is
approximately 5 points (20%). We can also say that the clearest distinction
between working with a male stranger, acquaintance or friend comes where
that person is seen to be older.
With the female partners the picture is far more complicated. While there
appears to be little difference in performance with an acquaintance
(irrespective of their age), there appears to be a degree of systematicity in the
rise in mean score when test takers are paired with a female stranger, resulting
in the highest mean scores when they are paired with an older stranger.
Looking across the graph we can see that the same pattern of mean scores can
136
7 Modelling factors affecting oral language test performance
be found with both male and female partners though the mean scores for
the former are approximately almost 2 points higher than those for the latter.
The greatest differences between the two sides of the graph are to be found in
the interactions with a friend. Whereas the test takers appeared to gain
systematically higher scores with older male strangers, when they are paired
with a female test taker they appear to achieve the highest mean scores when
they consider that person as being younger than themselves.
137
7 Modelling factors affecting oral language test performance
(again in particular interactions) seem to have had some effect on the mean
scores achieved by these test takers, confirms that the model of performance
proposed by OSullivan (2000b), and reproduced here as Figure 4, may be
used as a starting point for continued exploration of the notion of performance.
When consideration was given to the weighting of test-taker scores used by
Cambridge, ANOVA indicated that the same interaction effects were to be
found as were found in the analysis using the unweighted scores, though the
balance of the contribution of the scores from the different examiners was
altered somewhat.
Performance
Characteristics
of the test taker
Characteristics Characteristics
of the task of the interlocutor
Additional conclusions
The results suggest a link between performance on a test (as seen through
the eyes of the examiners) and the test takers perceptions of the person
they are paired with in that test. The conclusion must be that test takers
reactions to their partner somehow affect their performances on tasks.
There is a distinction made in the model between the interlocutor and the
task. It can be argued that this distinction is not entirely valid, as the
interlocutor is an aspect of the test performance conditions. The distinction
138
7 Modelling factors affecting oral language test performance
is made here in order to reflect the important role played by the interlocutor
in any interaction.
The model presented here can be seen as a move towards a more socio-
cognitive view of performance in which the cognitive processing of certain
kinds of information is recognised as being socially driven (see Channouf,
Py and Somat (1999) for evidence of empirical support for this approach to
cognitive processing).
139
7 Modelling factors affecting oral language test performance
140
7 Modelling factors affecting oral language test performance
References
Bachman, L. F. and A. S. Palmer. 1996. Language Testing in Practice.
Oxford: OUP
Berg, E. C. 1999. The effects of trained peer response on ESL students
revision types and writing quality. Journal of Second Language Writing.
Vol. 8: 3, 215241.
Brown, A. 1998. Interviewer style and candidate performance in the IELST
oral interview. Paper presented at the Language Testing Research
Colloquium. Monterey CA.
Channouf, A., J. Py and A. Somat. 1999. Cognitive processing of causal
explanations: a sociocognitive perspective. European Journal of Social
Psychology 29: 673690.
Horowitz, D. 1986. Process, not product: Less than meets the eye. TESOL
Quarterly 20: 141144.
Lazaraton, A. 1996a. Interlocutor support in oral proficiency interviews: the
case of CASE. Language Testing 13: 2, 151172.
Lazaraton, A. 1996b. A qualitative approach to monitoring examiner conduct
in the Cambridge assessment of spoken English (CASE). In M. Milanovic
and N. Saville (eds.) Performance Testing, Cognition and Assessment:
selected papers from the 15th Language Testing Research Colloquium,
Cambridge and Arnhem. Studies in Language Testing 3, pp. 1833.
Locke, C. 1984. The influence of the interviewer on student performance in
tests of foreign language oral/aural skills. Unpublished MA Project.
University of Reading.
McNamara, T. F. 1997. Interaction in second language performance
assessment: Whose performance? Applied Linguistics 18: 4, 446466.
Milanovic, M. and N. Saville. 1996. Introduction. Performance Testing,
Cognition and Assessment. Studies in Language Testing 3. Cambridge:
UCLES, 117.
OSullivan, B. 1995. Oral language testing: Does the age of the interlocutor
make a difference? Unpublished MA Dissertation. University of Reading.
OSullivan, B. 2000a. Exploring Gender and Oral Proficiency Interview
Performance, SYSTEM 28: 3, 373386
OSullivan, B. 2000b. Towards a model of performance in oral language
testing. Unpublished PhD dissertation, the University of Reading.
OSullivan, B. 2002 Learner Acquaintanceship and OPI Pair-Task
Performance, Language Testing, Volume 19: 3, 277295.
Porter, D., & O'Sullivan, B. (1999). The effect of audience age on measured
written performance. System, 27, 65-77.
Porter, D. 1991a. Affective Factors in Language Testing. In Alderson, J. C.
and B. North (Eds.) Language Testing in the 1990s. London: Macmillan
(Modern English Publications in association with The British Council), 32-
40.
141
Porter, D. 1991b. Affective Factors in the Assessment of Oral Interaction:
Gender and Status. In Sarinee Arnivan (ed) Current Developments in
Language Testing. Singapore: SEAMEO Regional Language Centre.
Anthology Series 25: 92-102
Porter, D. and Shen Shu Hung. 1991. Gender, Status and Style in the
Interview. The Dolphin 21, Aarhus University Press: 117-128.
Swain, M. 1985. Large-scale communicative language testing: A case study.
In Y. P. Lee, A. C. Y. Y. Fok, R. Lord and G. Low (eds.) New Directions
in Language Testing. Oxford: Pergamon. pp. 3546.
142
Self-assessment in DIALANG
8 An account of test
development
Sari Luoma
Centre for Applied Language Studies,
University of Jyvskyl, Finland
What is DIALANG?
DIALANG is a European project that is developing a computer-based
diagnostic language assessment system for the Internet. The project has been
going on since 1997, currently with 19 institutional partners, with financial
support from the European Commission Directorate-General for Education
and Culture (SOCRATES programme, LINGUA Action D).
The upcoming DIALANG system has unprecedented linguistic coverage: it
has 14 European languages both as test languages and as support languages,
i.e. languages in which users can see the program instructions. DIALANG is
intended to support language learners, and, in addition to test results, it offers
users a range of feedback and advice for language learning. While it proposes
a certain assessment procedure, the DIALANG system allows users to skip out
of sections and to select which assessment components they want to complete
in one sitting.
The assessment procedure in Version 1 of the DIALANG system begins
143
8 Self-assessment in DIALANG. An account of test development
with a test selection screen. Here, the user chooses the test he or she wants to
take, including choice of support language, the language to be tested, and the
skill to be tested. There are five skill sections available in DIALANG:
listening, reading, writing (indirect), vocabulary, and structures. Thus, the user
can choose, for instance, a listening test in French with instructions in Greek,
or a vocabulary test in German with instructions in English.
The selected test then begins with a placement test and a self-assessment
section. The combined results from these are used to select an appropriate test
for the user. The self-assessment section contains 18 I can statements and,
for each of these, users click on either Yes (=I can do this) or No (=I cant do
this yet). The development of this section will be discussed in this article.
Once both sections are completed, the system chooses a test of approximately
appropriate difficulty for the learner, and the test begins.
In DIALANG Version 1, the test that the user gets is fixed and 30 items
long. In Version 2, the DIALANG system will be item-level adaptive, which
means that the system decides which items to give to the user based on his or
her earlier answers. If the users get an item right, they get a more difficult
item. If they get it wrong, they get an easier item. When the system is fairly
sure that a reliable assessment of the learners proficiency has been reached,
the test ends, probably in less than 30 items. Once the tests are fully adaptive,
the placement test at the beginning of the procedure will no longer be required,
but the self-assessment section will remain a part of the DIALANG system
because of the belief that it is beneficial for language learning.
For the DIALANG user, the test works in much the same way whether it is
Version 1 or Version 2. The user gets test items in the selected skill one by
one, and when the test is complete, the user gets feedback.
The feedback menu in DIALANG has two main headings, Results, and
Advice. Under Results, the users have access to their test results including
comparison between their self-assessment and their test-based proficiency
level, their results from the placement test, and an option to go back and
review the items that they answered. They can see whether they got each item
right or wrong, and what the acceptable answers were. Under Advice, the
users can read advice about improving their skills, and explanations about
self-assessment. This may be particularly interesting for users whose test score
did not match their self-assessment. The users may read as much or as little
feedback as they wish, and, once finished, the system asks them whether they
want to choose another skill, another test language, or exit.
As evident from the description above, DIALANG gives self-assessment a
prominent position in the assessment procedure. This operationalises the
belief that ability to self-assess is an important part of knowing and using a
language. Every DIALANG user will be asked to self-assess their skills, and
even if they choose to skip it, they will at least have been offered a chance to
assess what they can do in the language that they are learning. If they do
144
8 Self-assessment in DIALANG. An account of test development
145
8 Self-assessment in DIALANG. An account of test development
146
8 Self-assessment in DIALANG. An account of test development
You are now asked to assess your overall ability in one of the main language skills: reading,
writing, speaking or listening. The computer has randomly assigned you the skill: Listening.
Please choose the statement below which most closely describes your ability in listening to the
language which is being tested. If more than one of the statements is true, pick the best one,
i.e. the one nearest to the bottom of the screen. Press Confirm when you have given your
response.
I can understand very simple phrases about myself, people I know and things Yes
around me when people speak slowly and clearly.
I can understand expressions and the most common words about things which are
important to me, e.g. very basic personal and family information, shopping, my Yes
job. I can get the main point in short, clear, simple messages and announcements.
I can understand the main points of clear standard speech on familiar matters
connected with work, school, leisure etc. In TV and radio current-affairs
programmes or programmes of personal or professional interest, I can understand
the main points provided the speech is relatively slow and clear. Yes
I can understand longer stretches of speech and lectures and follow complex lines
of argument provided the topic is reasonably familiar. I can understand most TV Yes
news and current affairs programmes.
I can understand spoken language even when it is not clearly structured and
when ideas and thoughts are not expressed in an explicit way. I can understand Yes
television programmes and films without too much effort.
I understand any kind of spoken language, both when I hear it live and in the
media. I also understand a native speaker who speaks fast if I have some time to Yes
get used to the accent.
constituted the data set reported here. Some of the groups compared in the
report below are smaller depending on the criteria for grouping the data; the
group sizes will be reported together with the results.
The background characteristics for the two participating samples, Finnish
and English test takers, were similar in terms of their age range, gender
distribution and educational background, but there was one potentially
significant difference between them, namely self-assessed proficiency. The
self-assessed ability levels of the participants in the Finnish pilots were fairly
normally distributed across the ability range, while the ability distribution of
the participants in the English pilots was slightly negatively skewed, i.e. the
participants tended to assess themselves towards the higher end of the ability
scale. This may have influenced the results reported below.
The self-assessment data can be analysed with respect to two questions that
were interesting for the project in terms of test development, the relationship
between the two types of self-assessment, i.e. Overall scale and I can
statements, and the relationship between self-assessed proficiency and
147
8 Self-assessment in DIALANG. An account of test development
Below are a number of statements relating to your ability to listen to the tested language. Read
each of the statements carefully and click
Yes if you CAN do what the statement says and
No if you CANNOT do what the statement says.
All questions must be answered. When you have finished, press submit. Please make sure that
you interpret each of the statements in relation to listening and not speaking, writing, reading,
etc.
I can catch the main points in broadcasts on familiar topics of personal interest
when the language is relatively slow and clear. Yes No
I can follow specialised lecutres and presentations which use a high degree of
Yes No
colloquialism, regional usage or unfamiliar terminology.
I can understand questions and instructions and follow short, simple Yes No
directions.
148
8 Self-assessment in DIALANG. An account of test development
40 -
30 - 20 -
20 -
10 - 10 -
N= 18 38 98 56 48 51 N = 21 22 58 48 45 29
1 2 3 4 5 6 1 2 3 4 5 6
Overall level (self-assessment) Overall level (self-assessment)
149
8 Self-assessment in DIALANG. An account of test development
30 - 28 -
26 -
24 -
20 - 22 -
20 -
10 - 18 -
16 -
14 -
0- 12 -
N= 18 37 98 56 48 51 N = 15 18 43 35 37 27
1 2 3 4 5 6 1 2 3 4 5 6
Overall level (self-assessment) Overall level (self-assessment)
small sample sizes, or it may reflect the fact that self-assessment at the
beginning levels is not very accurate. However, it may also indicate that not
all learners have the patience to read through the six levels of a self-
assessment scale before deciding which level is appropriate for them, some
may want to get on with the test and thus simply tick one of the first levels.
Certainly, this explanation cannot be rejected out of hand on the basis of
correlations between Overall self-assessment, I can statements, and test
performance, as shown in Table 1.
150
8 Self-assessment in DIALANG. An account of test development
151
8 Self-assessment in DIALANG. An account of test development
152
8 Self-assessment in DIALANG. An account of test development
Conclusion
The development of the self-assessment strand in DIALANG continues, and
with larger data sets it will be possible to investigate differences between
languages and possibly regions of origin, i.e. whether learners from some
regions of the world tend to over- or underestimate their skills, as has been
suggested (e.g. Oscarson 1997). It is important to remember, however, that
even if data are available, the meaning is not necessarily apparent from a mere
inspection of numbers. Additional insights need to be gained from learners
153
8 Self-assessment in DIALANG. An account of test development
using the system. Scale descriptors, for instance, are meaningful and important
for teachers and applied linguists, but learner interpretations of the descriptors
and differences between different descriptors are equally important if we are
to achieve a truly learner-orientated assessment system. Decisions in
DIALANG about whether or not to include Overall self-assessment in the
system must likewise be informed not only by statistical and programming
considerations, important as these are, but also by learner feedback on how
they interpret the scale and whether they find it useful. Moreover, if
statistically significant differences are found between self-assessments of
learners of Spanish and learners of Dutch, for example, we must carefully
search for content interpretations for the differences. This is why our project
members are conducting research into how learners interact with the
DIALANG system (Figueras and Huhta, in progress).
The prominence given to self-assessment in DIALANG is based on the
belief that it is beneficial for learners and that it promotes learner
independence. Since the whole DIALANG system is learner-orientated, it also
fits the system ideologically. However, we do not expect great changes in
learner orientation as a result of being exposed to DIALANG, nor are we
planning any experimental pre-DIALANG, post-DIALANG designs to detect
this. Rather, we expect the effects to be subtle. Being asked to assess ones
language ability raises awareness in DIALANG users that such evaluations
can be made, and through this, DIALANG brings its contribution to the array
of concepts that language learners have for evaluating their own proficiency.
The basic DIALANG system complements the self-assessment by comparing
it with test results and providing the users with information about why the two
might differ. Preliminary feedback from our experiment with self-rating
indicates that while learners feel able to assess their own proficiency, they also
need external assessment to form a picture of their skills that they can rely on.
Self-assessment is therefore not the be-all and end-all of all assessment, but we
would like to believe that it is a useful addition to the assessment arsenal that
the modern world offers language learners.
154
8 Self-assessment in DIALANG. An account of test development
References
AERA 1999. [American Educational Research Association, American
Psychological Association, and National Council on Measurement in
Education, 1999.] Standards for Educational and Psychological Testing.
Washington, D.C.: AERA.
Council of Europe 2001. Common European Framework of Reference for
Languages: Learning, Teaching, Assessment. Cambridge: CUP.
Figueras, N. and A. Huhta (in progress). Investigation into learner reception of
DIALANG.
Huhta, A., S. Luoma, M. Oscarson, K. Sajavaara, S. Takala, and A. Teasdale
(forthcoming). DIALANG a diagnostic language assessment system for
learners. In J. C. Alderson (ed.) Case studies of the use of the Common
European Framework. Council of Europe.
Luoma, S. and M. Tarnanen. 2001a. Experimenting with self-directed
assessment of writing on computer Part I: self-assessment versus external
assessment. Paper given in a symposium entitled Learning-centred
assessment using information technology at the annual Language Testing
Research Colloquium in St Louis, Missouri February 2024, 2001.
Luoma, S. and M. Tarnanen. 2001b. Experimenting with self-directed
assessment of writing on computer Part II: learner reactions and learner
interpretations. Paper given at the annual conference of the American
Association of Applied Linguistics in St Louis, Missouri February 2427,
2001.
North, B. 1995. The development of a common framework scale of language
proficiency based on a theory of measurement. PhD thesis, Thames Valley
University, November 1995.
Oscarson, M. 1997. Self-assessment of foreign and second language
proficiency. In Clapham, C. and D. Corson (eds.), Encyclopedia of
Language and Education, Volume 7: Language testing and assessment.
175187. Dordrecht: Kluwer.
Verhelst, N. D. and F. Kaftandjieva. 1999. A rational method to determine
cutoff scores. Research Report 99-07, Faculty of Educational Science and
Technology, Department of Educational Measurement and Data Analysis,
University of Twente, The Netherlands.
155
8 Self-assessment in DIALANG. An account of test development
Appendix 1
Shortened DIALANG I can self-assessment section for
listening
CEF
level DIALANG Listening I can statement
A1 I can follow speech which is very slow and carefully articulated, with long
pauses for me to get the meaning.
A1 I can understand questions and instructions and follow short, simple directions.
A2 I can understand enough to manage simple, routine exchanges without too much
effort.
A2 I can generally identify the topic of discussion around me which is conducted
slowly and clearly.
A2 I can understand enough to be able to meet concrete needs in everyday life
provided speech is clear and slow.
A2 I can handle simple business in shops, post offices or banks.
B1 I can generally follow the main points of extended discussion around me,
provided speech is clear and in standard language.
B1 I can follow clear speech in everyday conversation, though in a real-life situation
I will sometimes have to ask for repetition of particular words and phrases.
B1 I can understand straightforward factual information about common everyday or
job-related topics, identifying both general messages and specific details,
provided speech is clear and a generally familiar accent is used.
B1 I can catch the main points in broadcasts on familiar topics and topics of personal
interest when the language is relatively slow and clear.
B2 I can understand standard spoken language, live or broadcast, on both familiar
and unfamiliar topics normally encountered in personal, academic or vocational
life. Only extreme background noise, unclear structure and/or idiomatic usage
causes some problems.
B2 I can follow the essentials of lectures, talks and reports and other forms of
presentation which use complex ideas and language.
B2 I can understand announcements and messages on concrete and abstract topics
spoken in standard language at normal speed.
B2 I can understand most radio documentaries and most other recorded or broadcast
audio material delivered in standard language and can identify the speakers
mood, tone, etc.
C1 I can keep up with an animated conversation between native speakers.
C1 I can follow most lectures, discussions and debates with relative ease.
C2 I have no difficulty in understanding any kind of spoken language, whether live
or broadcast, delivered at fast native speed.
C2 I can follow specialised lectures and presentations which use a high degree of
colloquialism, regional usage or unfamiliar terminology.
156
Section 3
A European View
9
Council of Europe language
policy and the promotion of
plurilingualism
Joseph Sheils
Modern Languages Division/Division des langues vivantes,
DGIV Council of Europe/Conseil de lEurope, Strasbourg
Introduction
I am grateful to ALTE for taking the initiative in organising this significant
event in the European Year of Languages calendar and for the opportunity to
present some aspects of the Council of Europes work in promoting
plurilingualism in the context of the Year. The aims of the Year, which is
jointly organised by the Council of Europe and the European Union, can be
summarised as: to encourage more people to learn (more) languages and to
raise awareness of the importance of protecting and promoting the rich
linguistic diversity of Europe. The main messages of the Year are captured in
two slogans: Languages open doors to opportunities, social inclusion,
tolerance of differences; and Europe, a wealth of languages the more than
200 indigenous languages in the 43 member states of the Council of Europe
and the languages of migrant communities are equally valid as modes of
expression for those who use them.
In a Europe which is and will remain multilingual, policy responses to this
reality lie between two ends of a continuum. There is on the one hand policy
for the reduction of diversity, and on the other for the promotion and
maintenance of diversity; both can be pursued in the name of improved
potential for international mobility, improved communication and economic
development. The Council of Europe has always pursued the important aim of
maintaining the European cultural heritage, of which linguistic diversity is a
significant constituent, and for which linguistic diversity provides the vital
conditions. It does so through legal instruments such as the European Charter
for Regional or Minority Languages, which continues to be ratified by an
increasing number of states, and through its programmes of intergovernmental
co-operation involving the 48 states party to the European Cultural
159
9 Council of Europe language policy and the promotion of plurilingualism
160
9 Council of Europe language policy and the promotion of plurilingualism
161
9 Council of Europe language policy and the promotion of plurilingualism
162
9 Council of Europe language policy and the promotion of plurilingualism
1 Key documents and information are available on the Council of Europe Portfolio website:
http://culture.coe.int/portfolio. The text of the Common European Framework of Reference
is also available on this site. Further information can be obtained from: Modern Languages
Division, DG IV, Council of Europe, 67075 Strasbourg, France.
163
9 Council of Europe language policy and the promotion of plurilingualism
164
9 Council of Europe language policy and the promotion of plurilingualism
The Dossier offers the learner the opportunity to select relevant and
up-to-date materials to document and illustrate achievements or
experiences recorded in the Language Biography or Language
Passport.
165
9 Council of Europe language policy and the promotion of plurilingualism
The purpose of the ELP is not to replace the certificates and diplomas that
are awarded on the basis of formal examinations, but to supplement them
by presenting additional information about the owners experience and
concrete evidence of his or her foreign language achievements. Clearly, the
importance of the ELPs reporting function will vary according to the age
of the owner. It will usually be much less important for learners in the
earlier stages of schooling than for those approaching the end of formal
education or already in employment. For this reason the Council of Europe
has introduced a standard Language Passport for adults only. It is
particularly important to adult learners that the ELP should be accepted
internationally, and this is more likely to happen if the first of its
components is the same everywhere.
Pedagogical. The ELP is also intended to be used as a means of making the
language-learning process more transparent to learners, helping them to
develop their capacity for reflection and self-assessment, and thus enabling
them gradually to assume increasing responsibility for their own learning.
This function coincides with the Council of Europes interest in fostering
the development of learner autonomy and promoting lifelong learning.
These two functions are closely linked in practice as illustrated in the ELP
Guide for Teachers and Teacher Trainers.
For example, with younger learners the teacher might start with the Dossier,
in which learners are encouraged to keep examples of their best work. At a
somewhat later stage the Biography is introduced and learners are helped to
set their own learning targets and to review their learning progress. At a still
later stage the Passport is introduced so that learners can take stock of their
developing linguistic identity using the self-assessment grid from the
Common European Framework.
166
9 Council of Europe language policy and the promotion of plurilingualism
The process can be reversed and this may suit older learners. The Language
Passport might be introduced at the beginning as a way of challenging learners
to reflect on their linguistic identity and the degree of proficiency they have
already achieved in their target language or languages. They then proceed to
the Biography and set individual learning targets. Learning outcomes are
collected in the Dossier and evaluated in the Biography, and this provides the
basis for setting new goals. The process is repeated until the end of the course,
when the learners return to the Passport and update their self-assessment.
Clearly the emphasis placed on the different functions of the Portfolio may
vary depending on the nature and length of courses, but both functions are
important and all three parts must be included in any European Language
Portfolio if it is to be accredited by the Council of Europe. Draft models can
be submitted to the European Validation Committee for accreditation. Details
can be obtained from the Modern Languages Division in Strasbourg and all
relevant documents, including Rules for Accreditation, are available on the
portfolio website (see footnote on page 163).
167
9 Council of Europe language policy and the promotion of plurilingualism
168
9 Council of Europe language policy and the promotion of plurilingualism
169
A1 A2 B1 B2 C1 C2
Listening I can understand familiar words I can understand phrases and the I can understand the main points I can understand extended I can understand extended I have no difficulty in
and very basic phrases highest frequency vocabulary of clear standard speech on speech and lectures and follow speech even when it is not understanding any kind of
170
concerning myself, my family related to areas of most familiar matters regularly even complex lines of argument clearly structured and when spoken language, whether live
and immediate concrete immediate personal relevance encountered in work, school, provided the topic is reasonably relationships are only implied or broadcast, even when
surroundings when people speak (e.g. very basic personal and leisure, etc. I can understand the familiar. I can understand most and not signalled explicitly. I delivered at fast native speed,
slowly and clearly. family information, shopping, main point of many radio or TV TV news and current affairs can understand television provided I have some time to
local area, employment). I can programmes on current affairs programmes. I can understand programmes and films without get familiar with the accent.
catch the main point in short, or topics of personal or the majority of films in standard too much effort.
clear, simple messages and professional interest when the dialect.
announcements. delivery is relatively slow and
clear.
Appendix 1
Reading I can understand familiar names, I can read very short, simple I can understand texts that I can read articles and reports I can understand long and I can read with ease virtually all
words and very simple texts. I can find specific, consist mainly of high frequency concerned with contemporary complex factual and literary forms of the written language,
sentences, for example on predictable information in everyday or job-related problems in which the writers texts, appreciating distinctions including abstract, structurally
notices and posters or in simple everyday material such language. I can understand the adopt particular attitudes or of style. I can understand or linguistically complex texts
catalogues. as advertisements, prospectuses, description of events, feelings viewpoints. I can understand specialised articles and longer such as manuals, specialised
U n d e r s t a n d i n g
menus and timetables and I can and wishes in personal letters. contemporary literary prose. technical instructions, even articles and literary works.
understand short simple personal when they do not relate to my
letters. field.
Spoken I can interact in a simple way I can communicate in simple I can deal with most situations I can interact with a degree of I can express myself fluently I can take part effortlessly in
Interaction provided the other person is and routine tasks requiring a likely to arise whilst travelling fluency and spontaneity that and spontaneously without much any conversation or discussion
prepared to repeat or rephrase simple and direct exchange of in an area where the language is makes regular interaction with obvious searching for and have a good familiarity with
things at a slower rate of speech information on familiar topics spoken. I can enter unprepared native speakers quite possible. I expressions. I can use language idiomatic expressions and
and help me formulate what Im and activities. I can handle very into conversation on topics that can take an active part in flexibly and effectively for colloquialisms. I can express
trying to say. I can ask and short social exchanges, even are familiar, of personal interest discussion in familiar contexts, social and professional myself fluently and convey finer
answer simple questions in areas though I cant usually or pertinent to everyday life accounting for and sustaining purposes. I can formulate ideas shades of meaning precisely. If I
of immediate need or on very understand enough to keep the (e.g. family, hobbies, work, my views. and opinions with precision and do have a problem I can
familiar topics. conversation going myself. travel and current events). relate my contribution skilfully backtrack and restructure around
to those of other speakers. the difficulty so smoothly that
other people are hardly aware of
it.
Spoken I can use simple phrases and I can use a series of phrases and I can connect phrases in a I can present clear, detailed I can present clear, detailed I can present a clear, smoothly
Production sentences to describe where I sentences to describe in simple simple way in order to describe descriptions on a wide range of descriptions of complex subjects flowing description or argument
S p e a k i n g
live and people I know. terms my family and other experiences and events, my subjects related to my field of integrating sub-themes, in a style appropriate to the
people, living conditions, my dreams, hopes and ambitions. I interest. I can explain a developing particular points and context and with an effective
educational background and my can briefly give reasons and viewpoint on a topical issue, rounding off with an appropriate logical structure that helps the
present or most recent job. explanations for opinions and giving the advantages and conclusion. recipient to notice and
plans. I can narrate a story or disadvantages of various remember significant points.
relate the plot of a book or film options.
and describe my reactions.
Writing I can write a short, simple I can write short, simple notes I can write simple connected I can write clear, detailed text on I can express myself in clear, I can write clear, smoothly
postcard, for example sending and messages. I can write a very text on topics which are familiar a wide range of subjects related well-structured text, expressing flowing text in an appropriate
holiday greetings. I can fill in simple personal letter, for or of personal interest. I can to my interests. I can write an points of view at some length. I style. I can write complex
9 Council of Europe language policy and the promotion of plurilingualism
forms with personal details, for example thanking someone for write personal letters describing essay or report, passing on can write about complex letters, reports or articles that
example entering my name, something. experiences and impressions. information or giving reasons in subjects in a letter, an essay or a present a case with an effective
nationality and address on a support of or against a particular report, underlining what I logical structure which helps the
hotel registration form. point of view. I can write letters consider to be the salient issues. recipient to notice and
highlighting the personal I can select a style appropriate remember significant points. I
W r i t i n g
significance of events and to the reader in mind. can write summaries and
experiences. reviews of professional or
literary works.
9 Council of Europe language policy and the promotion of plurilingualism
References
Beacco, J-C. and M. Byram (unpublished draft). Guide for the Development
of Language Education Policies in Europe (Provisional title. Publication of
draft 1 in spring 2002.)
Council of Europe. 2001. Common European Framework of Reference for
Languages: Learning, Teaching, Assessment. Cambridge: Cambridge
University Press, pp. 45 (French edition: Editions Didier 2001).
Council of Europe. 2000. Education Policies for Democratic Citizenship and
Social Cohesion: Challenges and Strategies for Europe. Adopted texts.
Cracow (Poland) 1517. October 2000, 2124.
Council of Europe Education Committee. 2000. European Language
Portfolio: Principles and Guidelines. Document DGIV/EDU/LANG (2000)
33 (see also Portfolio website).
Little, D. and R. Perclov. 2001. European Language Portfolio. Guide for
Teachers and Teacher Trainers. Strasbourg: Council of Europe.
Schneider, G. and P. Lenz. 2001. Guide for Developers of a European
Language Portfolio. Strasbourg: Council of Europe.
Trim, J. 2002. Understanding the Common European Framework of
Reference for Languages: Learning, Teaching, Assessment Conference
(CEF). Paper given at 25th TESOL Spain Convention, Madrid.
171
10
Higher education and
language policy in the
European Union
Wolfgang Mackiewicz
Conseil Europen pour les Langues/
European Language Council
Introduction
In this paper, I shall be dealing with EU language and language education
policy and about policy development undertaken by the Conseil Europen
pour les Langues/European Language Council and the Thematic Network in
the Area of Languages in response to linguistic challenges posed by European
integration. I shall conclude by reflecting on measures that universities need
to take if they are to live up to their new role as institutions belonging to a
European area of higher education. In doing so, I shall have occasion to refer
to the draft Berlin Declaration issued by the members of the Scientific
Committee of the Berlin European Year of Languages 2001 Conference held
at the Freie Universitt Berlin on 2830 June 2001. In addition, I shall
repeatedly refer to the contributions made by the Council of Europe to the
promotion of multilingualism in Europe.
In discussing EU language policy, I shall be considering language policy at
Community level not language policy at member state or regional level.
However, during the course of the paper, I shall refer to what I regard as the
responsibility and duties of the member states and the regions in regard to
language education policy.
EU language policy
As regards EU language policy, two things need to be noted above all else.
(i) In education, the powers of the EU are limited. This is evidenced by the fact
that Article 149 of the Treaty Establishing the European Community describes
Community action in the fields of education, vocational training and youth in
terms of contributing to, supporting and supplementing action taken by the
member states. The Council of the European Union can (jointly with the
173
10 Higher education and language policy in the European Union
Linguistic diversity
The fundamental principle underlying all of the EUs statements on matters
regarding language policy is that of linguistic and cultural diversity. Unlike
the United States of America, the European Union is being constructed as a
multilingual society. The eleven national languages of the member states are
regarded as being equal in value. (In addition, Irish enjoys a special status;
174
10 Higher education and language policy in the European Union
175
10 Higher education and language policy in the European Union
176
10 Higher education and language policy in the European Union
177
10 Higher education and language policy in the European Union
Union, released last year, consists of one sentence: The Union shall respect
cultural, religious and linguistic diversity. There is a section on Regional and
minority languages of the European Union on the Commissions Education
website, which, among other things, lists projects that have received
Community support. More recently, there have been indications of a
movement towards including regional and minority languages in EU language
policy and European Union programmes. The Opinion of the Committee of
the Regions of 13 June 2001 on the Promotion and Protection of Regional and
Minority Languages certainly points in this direction.
Needless to say, the universities teaching, provision, development and
research in the area of languages cannot solely or even primarily be guided by
the political considerations of the European Union. I am convinced, however,
that the promotion of societal and individual multilingualism as reflected by
the language policy and language education policy propagated and promoted
by the EU is of crucial importance for the future of the Union and that
education in general, and higher education in particular, have responsibilities
and duties in this respect; higher education in particular, because of the
universities wide-ranging language-related activities in teaching, provision,
development and research, which include the following:
modern language degree programmes, area studies programmes, and
programmes combining language study with the study of other disciplines
teacher education
the training of translators and interpreters
the delivery of courses or of portions of courses in one or more than one
other language
language provision for non-specialists, including linguistic preparation and
support for mobility
provision for university staff and for people from the non-university
environment
development of materials for the above types of programme and for
language learning and assessment in general
research related to the issues of linguistic diversity and of language
learning, teaching, mediation, and assessment.
What should become clear from this list is that language study is not just
relevant to language and language-related degree programmes, but potentially
pervades the whole of higher education just as the language issue is relevant
to the whole of society. In fact, I believe it makes sense to view the above
types of activity as being interrelated, constituting the area of languages in
higher education, as it were.
The core of the above typology originates from one of the pilot projects that
were carried out as precursors of the SOCRATES-ERASMUS Thematic
Networks: the SIGMA Scientific Committee on Languages (12/94 10/95).
178
10 Higher education and language policy in the European Union
The members of the Committee were to produce national reports on the status
quo in higher education language studies, to identify new needs and to propose
measures for improvement and innovation. At the very first meeting, the
Committee decided to focus on the transmission of linguistic and cultural
knowledge and skills and on language mediation. This approach reflects one
of the central aims of thematic networks in general: to address the frequently
observed disconnection of higher education programmes from changing needs
in the social, professional, and economic environments. SIGMA led to the
creation of the Conseil Europen pour les Langues/European Language
Council in 1997 and to the launch of the first fully fledged Thematic Network
Project in the Area of Languages (19961999).
The CEL/ELC is a permanent and independent association of European
higher education institutions and associations specialising in languages.
Currently, its membership stands at 170. Its principal aim is quantitative and
qualitative improvement in knowledge of all the languages and cultures of the
EU and of other languages and cultures. It seeks to pursue this aim through
European co-operation. Apart from conducting workshops, organising
conferences, and publishing an information bulletin, the association has so far
been active in policy development and in initiating European curriculum and
materials development projects and co-operation projects, such as thematic
network projects. Among the policy papers prepared was a proposal for higher
education programmes and provision in regional and minority languages and
immigrant languages and, more recently, a comprehensive document on the
development and implementation of university language policies. This paper
is intended as a framework for recommendations and actions in education and
research linked to developments in the professional, economic, socio-cultural,
and political domains.
The CEL/ELC co-operates with the European Commission, notably with
the Directorate General for Education and Culture and the translation and
interpreting services, the European Parliament and the Council of Europe as
well as with associations such as Cercles, the European Association for
International Education, and the European University Association. It carried
out a pilot project for the trialling of the Council of Europes European
Language Portfolio in higher education, and the CEL/ELC and Cercles are
about to develop a common Portfolio for use in higher education.
179
10 Higher education and language policy in the European Union
180
10 Higher education and language policy in the European Union
181
10 Higher education and language policy in the European Union
182
10 Higher education and language policy in the European Union
Conclusion
During and after the Berlin Conference I gave a number of interviews to
Germany-based newspapers and radio stations. Within no time, the interviews
invariably turned to the question of language teaching and learning at school.
According to EUROSTAT, only 53% of all adults in the EU can take part in
a conversation in a language other than their mother tongue. Of course, we all
know how this can be remedied: through early language teaching and through
bilingual or multilingual education. This has become accepted wisdom. The
183
10 Higher education and language policy in the European Union
184
10 Higher education and language policy in the European Union
Bibliography
Charter of fundamental rights of the European Union.
http://db.consilium.eu.int/df/default.asp?lang=en
www.europarl.eu.int/charter/default_en.html
Conseil Europen pour les Langues/European Language Council. Universit
et politique linguistique en Europe. Document de Rfrence. Berlin, 2001.
Conseil Europen pour les Langues/European Language Council.
Website. http://sprachlabor.fu-berlin.de/elc
Council of Europe. 2001. Common European Framework of Reference for
Languages: Learning, Teaching, Assessment. Cambridge: Cambridge
University Press.
Council Resolution of 31 March 1995 on improving and diversifying
language learning and teaching within the education systems of the
European Union.
http://www.europa.eu.int/eur-lex/en/lif/dat/1995/en_395Y0812_01.html
Council Resolution of 16 December 1997 on the early teaching of European
Union Languages.
http://www.europa.eu.int/eur-lex/en/lif/dat/1998/en_398Y0103_01.html
Decision No 1934/2000/EC of the European Parliament and of the Council
of 17 July 2000 on the European Year of Languages 2001.
http://www.europa.eu.int/comm/education/languages/actions/decen.pdf
The DIALANG Project website. http://www.dialang.org/
The Draft Berlin Declaration issued by the Scientific Committee of the
Berlin EYL 2001 Conference (Freie Universitt Berlin, 2830 June 2001).
http://sprachlabor.fu-berlin.de/elc/docs/BDeclEN.pdf
Eurobarometer Report 54: Europeans and Languages
http://europa.eu.int/comm/education/baroexe.pdf
European Commission. Lingua website.
http://www.europa.eu.int/comm/education/languages/actions/lingua2.html
European Commission. Memorandum on Higher Education in the European
Community. Communication from the Commission to the Council on 5
November 1991 (COM (91) 349 final).
European Commissions website on regional and minority languages and
cultures. http://europa.eu.int/comm/education/langmin.html
European Commission. 1995. White Paper on Education and Training
Teaching and Learning Towards the Learning Society.
http://europa.eu.int/comm/education/lb-en.pdf and
http://europa.eu.int/en/record/white/edu9511/
Holdsworth, Paul. The work of the Language Policy Unit of the European
Commissions Directorate-General for Education and Culture. CEL/ELC
Information Bulletin 7 (2001): 1115.
http://sprachlabor.fu-berlin.de/elc/bulletin/7/index.html
185
10 Higher education and language policy in the European Union
186
Section 4
Work in Progress
11
TestDaF: Theoretical basis
and empirical research
Rdiger Grotjahn
Ruhr-Universitt Bochum, Germany
Introduction
TestDaF (Test Deutsch als Fremdsprache Test of German as a Foreign
Language) is a new standardised language test for foreign students applying
for entry to an institution of higher education in Germany. In this respect it is
comparable to the International English Language Testing System (IELTS)
and the Test of English as a Foreign Language (TOEFL). TestDaF consists of
four sub-tests and takes 3 hours and 10 minutes to complete: 60 minutes for
reading comprehension, 40 minutes for listening comprehension, 60 minutes
for written expression, and 30 minutes for oral expression.
In the press, TestDaF has generally been compared to the TOEFL rather
than to the IELTS. Characterising TestDaF as the German TOEFL or
TOEFLs little brother is, however, misleading, seeing that in terms of
content and format TestDaF is much closer to the IELTS than to the TOEFL.
One major reason for developing TestDaF was to make Germany more
attractive internationally to foreign students who wish to study abroad by
ensuring recognition of foreign applicants language proficiency by German
institutions of higher education, while the applicants are still in their home
country.
To achieve this aim, the DAAD, the German Academic Exchange Service,
set up a test development consortium1 and financed the development of
TestDaF from August 1998 to December 2000. During this period three
parallel test sets were developed and piloted. In addition, various materials
were developed, such as manuals for item writers and raters. Subsequently, the
TestDaF Institute was set up at the University of Hagen to provide the
necessary infrastructure for centralised test construction, centralised grading
and centralised test evaluation. On April 26 2001, TestDaFs first international
189
11 TestDaF: Theoretical basis and empirical research
A B C
Basic User Independent User Proficient User
A1 A2 B1 B2 C1 C2
Breakthrough Waystage Threshold Vantage Effective Mastery
Proficiency
2 At present, depending on where the applicant comes from, the fee for TestDaF varies from
90 to 110 Euros.
190
11 TestDaF: Theoretical basis and empirical research
TestDaF:
TDN 3 TDN 4 TDN 5
In the case of the writing and speaking sub-tests, three versions of the
TDNs exist: a short, test-user-orientated version intended for the candidate
and those interested in the candidates level of language proficiency, and two
much more detailed versions intended for the raters and the item writers (c.f.
Aldersons (1991) distinction between user-orientated, assessor-orientated and
constructor-orientated scales). A translation of the band descriptions for
Reading Comprehension reads as follows:
TDN 5
Can understand written texts from everyday university life (e.g. information
on study organisation) as well as texts on academic subjects not related to
specific disciplines (e.g. general environmental problems, socio-political
issues), which are complex in terms of both language and content, with regard
to overall meaning and specific details, and can also extract implicit
information.
TDN 4
Can understand written texts from everyday university life (e.g. information
on study organisation) as well as texts on academic subjects not related to
specific disciplines (e.g. general environmental problems, socio-political
issues), which are structurally similar to everyday usage, with regard to overall
meaning and specific details.
TDN 3
Can understand written texts from everyday university life (e.g. information
on study organisation) with regard to overall meaning and specific details;
however, cannot adequately understand texts on academic subjects not related
to specific disciplines (e.g. general environmental problems, socio-political
issues).
In TDN 4 the same discourse domains are referred to as in TDN 5, namely
everyday university life and academic subjects not related to specific
disciplines (for a discussion as to whether to include subject matter knowledge
as part of the construct to be measured, see for example, Alderson 2000,
Chapter 4; Davies 2001; Douglas 2000). However, the texts are no longer
characterised as complex in terms of both language and content, but as
structurally similar to everyday usage. In addition, no extraction of implicit
information is required. Finally, in TDN 3, complexity of information
processing is further reduced by restricting the discourse domain to everyday
university life.
191
11 TestDaF: Theoretical basis and empirical research
Reading Comprehension
The aim of the reading comprehension sub-test is to assess the candidates
ability to understand written texts relevant in academic contexts. The sub-test
is intended to tap the following aspects of information processing: extraction
of selective information, reading for the gist of a message and for detailed
information, and complex information processing including comprehension of
implicit information.
Three different types of text and task are used to test these skills: in the first
task, a list of statements and several short and not very complex descriptive
texts are to be matched. In the second task, a longer and more complex text is
used, together with multiple-choice questions offering three options. The third
task consists of a fairly long and complex text and forced-choice items of the
type Yes/No/No relevant information in text. Because the difficulty of a
comprehension task is a function of the difficulty of both text and items, the
complexity of both text and items has been taken into account in task design
(c.f. Alderson 2000; Grotjahn 2000, 2001). There are 30 items in all. The level
of difficulty of the three tasks is intended to match TDN 3, TDN 4 and TDN 5
respectively.
Listening Comprehension
The sub-test of Listening Comprehension is intended to assess the candidates
ability to understand oral texts relevant in academic contexts. Different levels
of processing are tapped on the basis of three different types of text and item:
selective information extraction, listening for gist and detailed information,
and complex information processing.
In the first part, a dialogue typical of everyday student life at university is
192
11 TestDaF: Theoretical basis and empirical research
presented once: the candidates are instructed to read the questions given
beforehand and then to write short answers while listening to the text. In the
second part, an interview or a discussion about study-related matters is
presented once: the candidates are asked first to read the questions, which are
in the true-false format, and then to answer them while listening to the text. In
the third task, a monologue or a text containing relatively long monologue
passages is played twice and the candidates have to write short answers to
questions on the text. There are 25 questions in all. As in the case of Reading
Comprehension, the level of difficulty of the three tasks corresponds to
TDN 3, TDN 4 and TDN 5 respectively.
Written Expression
The aim of the sub-test of Written Expression is to assess the candidates
ability to write a coherent and well structured text on a given topic taken from
an academic context. In particular, the following macro-skills are tested as
both are key qualifications for academic study: (a) describing facts clearly and
coherently; and (b) developing a complex argument in a well structured
fashion.
In the first part of the writing task, a chart, table or diagram is provided
along with a short introductory text, and the candidate is asked to describe the
pertinent information. Specific points to be dealt with are stated in the rubric.
In the second part, the candidate has to consider different positions on an
aspect of the topic and write a well structured argument. The input consists of
short statements, questions or quotes. As in the case of the description, aspects
to be dealt with in the argumentation are stated in the rubric. Both parts have
to be related to each other to form a coherent text.
The candidates written text is assessed by two licensed raters on the basis
of a detailed list of performance criteria. These include: grammatical and
lexical correctness, range of grammar and lexis, degree of structure and
coherence, and appropriateness of content. If the raters do not agree, the text
is assessed by a third rater.
Oral Expression
The sub-test Oral Expression assesses the candidates ability to perform
various conventional speech acts that are relevant in an academic context. The
format of the sub-test is based on the Simulated Oral Proficiency Interview
(SOPI) of the Center for Applied Linguistics in Washington, DC. Test
delivery is controlled by a master audiotape and a test booklet, and the tasks
are presented to the candidates both orally from tape and in print. The test is
preferably done in a language laboratory or, if not possible, with two cassette
recorders. The candidates responses are recorded on a second tape, allowing
centralised rating (c.f. Kenyon 2000).
193
11 TestDaF: Theoretical basis and empirical research
The sub-test Oral Expression consists of four parts and comprises tasks of
varying levels of difficulty covering TDN 3 to TDN 5: in the first part, the
warm-up, the candidate is asked to make a simple request. The second part,
which consists of four tasks, focuses on speech acts relevant in everyday
student life, such as obtaining and supplying information, making an urgent
request and convincing someone of something. The third part, which consists
of two tasks, centres on the speech act of describing, while the fourth part,
which comprises three tasks, focuses on presenting arguments.
The candidates oral performance is assessed by two licensed raters on the
basis of a detailed list of performance criteria. These include: fluency, clarity
of speech, prosody and intonation, grammatical and lexical correctness, range
of grammar and lexis, degree of structure and coherence, and appropriateness
of content and register. If the raters do not agree, the performance is assessed
by a third rater.
To reduce the number of tasks to be rated and thus make the rating process
less time-consuming, the rating starts with tasks at the TDN 5 level. If the
candidates ability is considered to correspond to TDN 5, the rating process is
terminated; otherwise, tasks at TDN 4 are assessed and it is decided whether
the tasks at TDN 3 need to be rated as well (for a more detailed description of
the sub-test Oral Expression see Kniffka and stnsz-Beurer 2001).
Test methodology
Each test set has been trialled several times: first, in an informal pilot study
with approximately 20 German native speakers and approximately 40 learners
of German as a foreign language in Germany; and second, with learners of
German as a foreign language worldwide. In the latter case, the sample size
varied from approximately 120 in the case of the writing and speaking sub-
tests, to approximately 200 in the case of the reading and listening
comprehension sub-tests.
In all trials the participants were given a questionnaire and asked to provide
information on, for example, their language proficiency and periods of
residence in Germany. In addition, the subjects could comment extensively on
the test tasks. These data were intended to supplement the statistical item
analysis by providing qualitative information on specific aspects of the test.
The examiners were also given a questionnaire in which they could
comment on the administration of the test as well as on the test itself.
In the case of Reading and Listening Comprehension the pre-test results
were statistically analysed by means of classical item analyses and Rasch
analyses in co-operation with the University of Cambridge Local
Examinations Syndicate (UCLES). Statistical criteria taken into account
included: item difficulty, item discrimination, contribution of an item to sub-
test reliability, item and person misfit. These criteria were applied flexibly in
an iterative process.
194
11 TestDaF: Theoretical basis and empirical research
Data analysis
1st revision
Data analysis
2nd revision
2nd pretesting
Data analysis
3rd revision
195
11 TestDaF: Theoretical basis and empirical research
The final versions of the three test sets turned out to have quite satisfactory
psychometric properties. The reliabilities of the Reading Comprehension and
Listening Comprehension sub-tests varied between approximately .70 and .85.
As a rule, for high-stakes decisions on individuals a reliability of .9 or more is
recommended. However, when the test results are reported in the form of a
profile, as is the case with TestDaF, a reliability of .9 or more for each sub-test
is normally not achievable, unless one considerably increases the number of
tasks and items in each sub-test. This was not possible, because TestDaF is
already quite lengthy. Nevertheless, this issue has to be considered in future
revisions of TestDaF.
Another concern was the inter-rater agreement in the sub-tests Oral
Expression and Written Expression, which was in quite a number of
instances not satisfactory, particularly in the case of Written Expression. It
is hoped that, with the help of the multi-faceted Rasch model, problematic
aspects of the rating process could be better identified and the quality of the
ratings thus improved.
Two other results of the statistical analyses also deserve mention. Both the
classical item analyses and the Rasch analyses indicate that the first reading
comprehension task, which requires a matching of statements and short texts,
appears to tap a dimension separate from that measured by the two remaining
tasks (for a justification see also Grotjahn 2000, p. 17f.; Taillefer 1996). As a
consequence, scaling all items in the Reading Comprehension sub-test on a
single dimension with the help of the one-parameter Rasch model appears to
be problematical.
The other result worth noting relates to the yes/no questions in the Listening
Comprehension test. Yes/no questions are quite common in listening tests.
They have, however, some well known drawbacks. In addition to the high
probability of guessing the correct answer, items with true as the correct
answer often discriminate best among low-proficiency learners, items with
false as the correct answer often differentiate best among high-proficiency
learners, and neither of them discriminates well among low- and high-
proficiency learners. Moreover, the correlation between items with true as
the correct answer and those with false as the correct answer is often low or
even negative (c.f. Grosse and Wright 1985; de Jong, December 24, 1999,
personal communication). Because some of TestDaFs true/false-items
behaved quite erratically, the true/false format should be reconsidered in
future revisions.
Information technology
One of the roles of information technology in the TestDaF Project was to
establish an appropriate infrastructure for test trialling and subsequent
worldwide official administration. The information technology is conceived to
support the current version of the test as well as the development of a
196
11 TestDaF: Theoretical basis and empirical research
computer-based or even web-based format (for the latter, see for example
Rver 2001).
The information technological infrastructure is being developed by the
Department of Applied Computer Technology at the German Open University
in Hagen (FernUniversitt Gesamthochschule in Hagen), and consists of three
main components, the item bank, the test-taker database and the test
administration database. It can handle any type of data, including multimedia
data (cf. Gutzat, Pauen and Voss 2000; Six 2000).
The item bank contains the items, tasks, sub-tests and tests as well as
formatting instructions for the automatic assembly of tasks in a specific print
format. An item or a task can be assigned certain attributes, which can then be
used, for example, to search for specific items or tasks, to compile tests
automatically or to select a task in computer-adaptive test administration.
The test-taker database contains personal information on the participants
such as name, address, test date and level obtained in each of the four sub-
tests. It can easily be searched and various statistical analyses can be carried
out.
In the test-administration database, date and place of the examination are
stored together with the candidates scores and bands, the questionnaire
answers and the results of item analyses such as difficulty and discrimination
values.
For all three components a uniform, easy-to-use, web-based user interface
is being developed on the basis of the most recent software technology to
afford maximum flexibility as well as independence from commercial test
creation and delivery software.
197
11 TestDaF: Theoretical basis and empirical research
For the equation to the TDNs, grammar items from the German item bank
of the computer-adaptive Linguaskill test were provided by UCLES as
anchors.3 The anchors had themselves been scaled and equated to the ALTE
levels on the basis of scores from various examinations.
It is obvious that the characterisation of the candidates reading and listening
proficiency by means of a TDN is based on a highly complex chain of
inferences: on the basis of the candidates responses to the test items, a
numerically based inference is made with regard to the non-observable construct
comprehension ability in academic contexts. Next, this ability estimate is
related to a system of scaled can-do statements on the basis of quite limited data.
Empirical studies are needed to demonstrate whether this highly inferential
process as well as the descriptors used in the TDNs are sufficiently valid.4
When TestDaF and its theoretical basis were first presented to the public, a
sometimes polemical discussion ensued. Political aspects aside, the following
features of TestDaF were criticised in particular:
1. The testing of reading and listening as isolated skills rather than in
combination with writing
2. The use of multiple-choice items rather than open-ended tasks
3. The testing of speaking by means of predetermined stimuli presented in
print and on audiotape rather than by means of a face-to-face oral interview.
Before addressing each issue, I shall deal briefly with the question of
authenticity in language testing, which is involved in all three issues.
Authenticity
It is often argued that language tests should be authentic that is, that they
should mirror as closely as possible the content and skills to be assessed. In
my view, authenticity should not be overemphasised in the context of a high-
stakes test such as TestDaF (for a similar view see Alderson 2000; Lewkowicz
2000). Authenticity might be important with regard to the face validity of a
test or the potential impact on foreign language classes. However, a highly
authentic test is not necessarily a highly valid test. If the candidate pays quite
a large sum of money for a test and if success or failure in the test entails
important consequences, the candidates will do their best even if they consider
the test to be inauthentic.
3 In a recent study, Arras, Eckes and Grotjahn (2002) have investigated whether the C-Test
could be used for the calibration of TestDaFs reading and listening comprehension items.
The C-Test developed proved highly reliable (Cronbachs alpha = .84) and could be
sucessfully scaled on the basis of Mllers (1999) Continuous Rating Scale Model.
Furthermore, the C-Test correlated substantially with the reading, listening, writing and
speaking parts of TestDaF (Spearmans r > .64).
4 Relating the content of the items clustering at a specific ability level to the TDNs proved to
be not very informative (c.f. McNamara 1996, pp 200ff., for this kind of content referencing).
198
11 TestDaF: Theoretical basis and empirical research
199
11 TestDaF: Theoretical basis and empirical research
communication are certainly missing and can thus not be adequately tested.
However, as a consequence of the standardisation of the input and the
centralised marking of the tapes, objectivity and reliability and thus possibly
also criterion-related validity are higher than in a traditional oral proficiency
interview. Furthermore, the SOPI is much more economical than the ACTFL
OPI, at least if it can be administered as a group test in a language laboratory.
Finally, as the recently developed computer-based version of the SOPI, the
Computerised Oral Proficiency Instrument (COPI), demonstrates, with some
slight modifications the SOPI format lends itself even to some form of
computer-adaptive testing (c.f. Kenyon and Malabonga 2001; Kenyon,
Malabonga and Carpenter 2001; Norris 2001).
Perspectives
In the case of a high-stakes admission test such as TestDaF, long-term,
quantitatively and qualitatively orientated empirical research is a must (c.f.
Grotjahn and Kleppin 2001: 429). A key issue to be investigated as soon as
possible is the question of whether a candidates TestDaF band profile is a
sufficiently valid indicator of their communicative proficiency in a real-life
academic context. Furthermore, research is needed for example into what a
computer-based and a web-based version of TestDaF should look like. In this
context, the following issues in particular need to be addressed in the near future:
1. Should the rather lengthy reading texts be replaced by texts that fit on a
single screen? Or should the tasks in the reading and listening
comprehension sub-tests even be replaced by a series of much shorter tasks,
each consisting of a short text and a few items? Should these tasks then be
treated as stochastically independent testlets in the statistical analyses and
be analysed by means of the partial credit model or Mllers (1999)
Continuous Rating Scale Model? The testlet approach would have the
advantage that the reading and listening comprehension sub-tests could be
made more reliable and that even a computer-adaptive test algorithm could
be implemented. A possible drawback of such an approach is that the
ability to comprehend complex and lengthy texts, which is important in an
academic context, can probably not be adequately assessed in this way.
2. One should examine whether the marking of the sub-test Oral Expression
could be made less time-consuming. For example, one could investigate
whether, in the case of a candidate identified as low-proficient in the
reading and listening comprehension sub-tests, the rating should proceed in
a bottomup manner instead of the present topdown approach. With
regard to a computer-based version, one should also investigate whether
candidates identified as low-proficient beforehand should be given less
difficult tasks than highly proficient candidates. Research into the COPI
shows that oral proficiency testing can thus be made more economical and
also less threatening for some candidates.
200
11 TestDaF: Theoretical basis and empirical research
References
Alderson, J. C. 1991. Bands and scores. In J.C. Alderson and B. North (eds.),
Language Testing in the 1990s: The Communicative Legacy (pp. 7186).
London: Macmillan.
Alderson, J. C. 2000. Assessing Reading. Cambridge: Cambridge University
Press.
Arras, U., T. Eckes and R. Grotjahn. 2002. C-Tests im Rahmen des Test
Deutsch als Fremdsprache (TestDaF): Erste Forschungsergebnisse. In R.
Grotjahn (ed.), Der C-Test: Theoretische Grundlagen und praktische
Anwendungen (Vol. 4, pp. 175209). Bochum: AKS-Verlag.
Association of Language Testers in Europe (ALTE) 1998. ALTE handbook of
European Language Examinations and Examination Systems. Cambridge:
University of Cambridge Local Examinations Syndicate.
Bennett, R. E. and W. C. Ward (eds.) 1993. Construction vs. Choice in
Cognitive Measurement: Issues in Constructed Response, Performance
Testing, and Portfolio Assessment. Hillsdale, NJ: Erlbaum.
Chalhoub-Deville, M. (ed.) 1999. Issues in Computer Adaptive Testing of
Reading Proficiency. Cambridge: Cambridge University Press.
Cohen, J. 1968. Weighted kappa: Nominal scale agreement with provision for
scaled disagreement or partial credit. Psychological Bulletin, 70: 213220.
Council of Europe. 2001. Common European Framework of Reference for
Languages: Learning, Teaching, Assessment. Cambridge: Cambridge
University Press.
de Jong, J. H. A. L. (December 24, 1999). Personal communication.
Davies, A. 2001. The logic of testing Languages for Specific Purposes.
Language Testing 18: 2, 133147.
Douglas, D. 2000. Assessing Languages for Specific Purposes. Cambridge:
Cambridge University Press.
Embretson, S. E. and S. Reise. 2000. Item Response Theory for Psychologists.
Hillsdale, NJ: Erlbaum.
Grosse, M. E. and B. D. Wright. 1985. Validity and reliability of true-false
tests. Educational and Psychological Measurement 45: 114.
Grotjahn, R. 2000. Determinanten der Schwierigkeit von Leseverstehens-
aufgaben: Theoretische Grundlagen und Konsequenzen fr die
Entwicklung von TESTDAF. In S. Bolton (ed.), TESTDAF: Grundlagen
fr die Entwicklung eines neuen Sprachtests. Beitrge aus einem
Expertenseminar (pp. 755). Kln: VUB Gilde.
Grotjahn, R. 2001. Determinants of the difficulty of foreign language reading
and listening comprehension tasks: Predicting task difficulty in language
tests. In H. Prschel and U. Raatz (eds.), Tests and translation: Papers in
memory of Christine Klein-Braley (pp. 79101). Bochum: AKS-Verlag.
201
11 TestDaF: Theoretical basis and empirical research
202
11 TestDaF: Theoretical basis and empirical research
203
12
A Progetto Lingue 2000
Impact Study, with special
reference to language testing
and certification
Roger Hawkey
Educational Consultant, UK
Introduction
An impact study of a major national language development innovation such as
the Progetto Lingue 2000 (PL2000) in Italy, commissioned by the University
of Cambridge Local Examinations Syndicate (UCLES), is an appropriate topic
for inclusion in a collection of papers on language-testing issues in the
European Year of Languages. This paper, which summarises and updates a
presentation given at the ALTE Conference in Barcelona on 6 July 2001,
outlines the first stages and presents some tentative early findings of the
Cambridge PL2000 Impact Study. The presentation included video recordings
of activities from Progetto Lingue classrooms, and of interviews with students,
parents, teachers, school heads and Ministero della Pubblica Istruzione (MPI)
officials. In addition, the PL2000 co-ordinator, Dr Raffaele Sanzo of the MPI
(now renamed the Ministero dellIstruzione e della Universitaria Ricera
(MIUR)) described the principles and practices of the Progetto itself at the
conference in a plenary presentation.
205
12 A Progetto Lingue 2000 Impact Study
In language teaching and testing, the concept of impact has been a matter
of both theoretical and practical consideration, often in distinction from
washback. Hamp-Lyons (1997) sees washback as referring to the ways in
which tests affect teaching and learning, and impact as covering their broader
influence on education and society. Judged against these definitions, the
Cambridge PL2000 Impact Study qualifies as an impact study, taking as it
does the kind of multi-dimensional approach proposed by Bailey (1996), and
Alderson and Wall (1993). The study considers the impact of the PL2000 on
parents, educational managers, language-teaching materials producers,
language testers and employers, and students and teachers. It also attempts to
cover teaching/learning processes as well as content, what Milanovic and
Saville (1996) refer to as the complex interactions between the factors which
make up the teaching/learning context (including the individual learner, the
teacher, the classroom environment, the choice and use of materials, etc.),
(p. 2).
206
12 A Progetto Lingue 2000 Impact Study
207
12 A Progetto Lingue 2000 Impact Study
Students Students
Parents Parents
Teachers Teachers
Teacher- PL learning Teacher-
trainers Methodology goals, Teacher trainers
curriculum, Support
Testers syllabus Testers
Publishers Publishers
Receiving Receiving
institutions institutions
Employers
208
12 A Progetto Lingue 2000 Impact Study
209
12 A Progetto Lingue 2000 Impact Study
PL2000 and other candidacies per school cycle (media, licei classici,
scientifici, magistrali; istituti tecnici, professionali, darte)
PL2000 and other candidacies per region
students attending PL2000 exam preparation courses
PL2000 and other external exam candidate ages, years of English study,
hours per week of English, other foreign languages studied, English
language exposure.
Relevant to the Impact Study, too, are the kind of written English language
performance analyses for PL2000, Italy non-PL and global comparator
students, which can be obtained from the Cambridge Learner Corpus (CLC)
(see Appendix B). These have already been used in the pursuit of the teacher
support aim of the Cambridge Impact Study (see above). Seminars using CLC
data examine, with PL2000 teachers, selected writing exam scripts (suitably
anonymised) of PL2000 candidates, with a view to improving future
performance.
PLIS is also collecting and analysing longitudinal case-study data,
including:
video-recorded classroom observation of PL2000 teaching/learning
approaches and activities, materials, media, classroom management and
assessment
semi-structured, video-recorded individual and group interviews with
school heads, teachers, students, parents, alumni, PL2000 officials and
employers
the administration of the ALTE, Oxford University Press, UCLES Quick
Placement Test (QPT) and the UCLES Language Learning Questionnaire
(LLQ) to students in the case-study classes
completions by students in the case-study classes of the PLIS student
questionnaire on language background, foreign-language experience, use,
attitudes and plans, testing experience, life goals
case-study student work and internal test samples
correspondence with PLIS teachers and students
entries from teacher-contestants in the 20012002 UCLES PL2000 essay
prize (Title: What the Progetto Lingue 2000 means to me as a teacher).
By the end of April 2002, the longitudinal case-study visit data from the
seven selected schools, an elementary school, a comprehensive school and a
technical institute in the north, a middle school and a liceo in central Italy, and
a middle school and science liceo in the south, included the following:
20 videoed PL2000 language classes
20 completed PLIS teacher questionnaires (see teacher questionnaire
format at Appendix B)
110 completions each of the October 2001 and April 2002 PLIS student
questionnaires (see student questionnaire format, along with school and
210
12 A Progetto Lingue 2000 Impact Study
More than 100 students are thus involved in the case-study classes, who
will, in addition to their videoed classroom action, be covered by their own
responses to:
the PLIS student questionnaire
the Quick Placement Test (QPT) to identify student proficiency levels
corresponding to the Council of Europe Framework for Foreign Languages
the Cambridge Language Learning Questionnaire (LLQ) on socio-
psychological factors: attitudes; anxiety; motivation; effort, and strategic
factors: cognitive and metacognitive
and by
samples of their EL work
internal language test performance data
teacher comments.
211
12 A Progetto Lingue 2000 Impact Study
of the target external exam (e.g. the PET course). Such cases of impact from
test to language course warrant in-depth analyses from our classroom,
interview and questionnaire data.
PL2000 stands or falls, of course, on the effectiveness of the teaching and
learning on PL programmes: how communicative and how good are the
lessons? The Progetto certainly seems to have got its communicative message
across although the communicative approach sometimes seems to be equated
with oral skills training; this is perhaps a reaction against the reading and
writing-biased language teaching of the past, and to the PL2000s implicit
encouragement of appropriate partial competencies. Our analyses of the
classes videoed are checking for coverage of communicative domains,
purposes, settings, interaction, modes, media, levels, skills and functions.
Video data so far suggest the following important areas of analysis in this
regard:
lesson planning and management
task design and implementation
variety of activity and interaction modes
teacher : student relationships and activity balance
information technology use and integration
educational value added
teaching/learning : test relationships.
There are mixed responses so far on the availability and use of the PL2000
resource centres; also on teacher support programmes. The financial support
from MPI/MIUR for PL2000 courses is much appreciated, although there are
sometimes problems of timing. The PL2000 agreement that students should
not pay for PL2000-related exams, for example, may affect enrolments in
external exams if fee financing is late. This has important implications for the
end of the Project, too. Will fewer students enter for external exams when they
or their schools have to pay?
There are clearly many more insights to come from the Cambridge PL2000
Impact Study. Even from the examples of early indications above, it may be
seen that the areas of impact are broad and varied. Once the April 2002 follow-
up data have been collected, there should also be evidence of change over a
year or more of Progetto experience among our case-study students, parents,
teachers, school heads and others.
212
12 A Progetto Lingue 2000 Impact Study
References
Alderson, J. and D. Wall. 1993. Does washback exist? Applied Linguistics 14:
115129.
Bailey, K. 1996. Working for washback: a review of the washback concept in
language testing. Language Testing 13: 257279.
Hamp-Lyons, L. 1997. Washback, impact and validity: Ethical concerns.
Language Testing 14: 295303.
Kirkpatrick, D. L. (ed.) 1998. Another look at Evaluating Training Programs.
American Society for Training and Development (ASTD).
McKay, V. and C. Treffgarne. 1998. Evaluating Impact, Proceedings of the
Forum on Impact Studies (2425 September 1998), Department For
International Development Educational Papers, Serial No. 35.
Milanovic, M. and N. Saville. 1996. Considering the impact of Cambridge
EFL exams.vv Research Notes 2, August 2000.
Saville, N. and R. Hawkey. (forthcoming). Investigating the washback of the
International English Language Testing System on classroom materials. In
Liying Cheng and Yoshinori J. Watanabe (eds). Context and Method in
Washback Research: The Influence of Language Testing on Teaching and
Learning.
Varghese, N. V. 1998. Evaluation vs. impact studies in V. McKay and C.
Treffgarne 1998. Evaluating Impact, Proceedings of the Forum on Impact
Studies (2425 September 1998), Department For International
Development Educational Papers, Serial No. 35.
213
12 A Progetto Lingue 2000 Impact Study
Appendix A
214
12 A Progetto Lingue 2000 Impact Study
Appendix B
Your qualifications
Types of school
215
12 A Progetto Lingue 2000 Impact Study
Very well
Well
Not very well
Hardly at all
show how well the following objectives
of the PL2000 have been achieved in
your school.
Kommentare
Please list here and comment on what you see as the advantages and disadvantages of the
PL2000 for your students and you.
Please write here how you think your English language teaching and attitudes have changed over
the past year.
PLIS TQ
Thank you very much!
216
12 A Progetto Lingue 2000 Impact Study
Appendix C
PLIS Case-study Student Questionnaire
About you and your language learning
Full name
School Class Age Male/Female
Language(s) you study at school?
How many years have you studied English?
Any extra English course(s) this school year? (Yes/No)
If yes, what course(s)? Where, when?
217
12 A Progetto Lingue 2000 Impact Study
Please put numbers in the boxes to show how often you do the following activities in
English outside your school.
0 = never; 1 = almost never; 2 = occasionally; 3 = often
Please write here how you think your English language learning and attitudes have changed
over the past year.
218
12 A Progetto Lingue 2000 Impact Study
FL E, F, G, SP
Tests Cl 1 elementary; 2, pre-inter, 3 intermediate, 4 upper-inter, 5 mostly Brit Lit
TBS Headway; Enterprise; FC Gold; Now and Then (Zanichelli; Views of Literature
Loescher)
R Centre Yes; 2nd floor: books, videos, cassettes, magazines; becoming TRC
Use of E TV, Internet, overseas travel, study holidays; further studies, travel
PLI + co-operation, objectivity in evaluation competence, student motivation
18 17 16 15 M/F? Class
5 1 10 2 5/13 3F
219
12 A Progetto Lingue 2000 Impact Study
220
13
Distance-learning Spanish
courses: a follow-up and
assessment system
Introduction
The easy-access Assessment System of the Instituto Cervantes distance-
learning Spanish courses on the Internet allows students to consult a large,
well-organised storage area housing all the information on their performance.
The system was designed for semi-autonomous use and to allow students to
develop their own learning strategies, working jointly with study group
members and tutors. It falls to students themselves to use the system as often
as necessary to gain a clear picture of their progress.
The Follow-up and Assessment System is housed in the Study Room, the
place reserved for students to work in with their group companions and tutor.
Here students have all the teaching aids and materials at their fingertips. This
space is both the students reference and starting point.
To receive assessment information, students have only to click on the
Assessment System icon to consult a number of reports organised into three
systems or sections: Automatic Follow-up, Automatic Progress Test and
Course Tutors Assessment.
Automatic Follow-up
This system, activated by students themselves, accesses a wide-ranging
document. Whenever students require an overall view of their performance
and work pace or wish to revise certain points, the Automatic Follow-up
System provides the requested information in a stratified and detailed manner.
At the behest of students, this system stores the results of the study
exercises done by students on their own, without help from tutors or
companions.
221
13 Distance-learning Spanish courses: a follow-up and assessment system
Theoretical Framework
In the wake of the Internet, a new medium which has generated new needs, the
Instituto Cervantes has created its distance-learning Spanish courses on the
Internet (Gerardo Arrarte et al. 2000). Our objective is to teach Spanish as a
Second Language to an adult public by using this new technology.
We believe that our target students, as students of a second language,
expect to be able to make satisfactory communicative exchanges in the target
language, and, as students who have opted for a distance-learning method on
the Internet, they expect to be able to acquire this new knowledge within a
flexible timetable.
These two approaches, at first glance very different, interlock perfectly in
the Instituto Cervantes distance-learning Spanish courses.
For two reasons, our courses follow the task-based learning principles of
the Communicative Approach. Firstly, the Internet permits the use of real
materials and encourages students to seek out new examples of the target
language. Secondly, contacting other students who share a similar interest in
222
13 Distance-learning Spanish courses: a follow-up and assessment system
223
13 Distance-learning Spanish courses: a follow-up and assessment system
anyone who wants to learn a foreign language must be aware that the
learning process depends to a large extent on their own sense of
responsibility and degree of participation.
Despite this requirement to work together, students still benefit from the
advantages of the one-to-one attention and assessment of a tutor, responsible
for assessing communicative exercises. They also benefit from the tools
offered by the computer system for the more quantifiable aspects.
The principal aim of our present analysis is to showcase the assessment
tools we have incorporated into the computer system to speed students
acquisition of knowledge and different communication skills, and to show
how these tools aid tutors in this new learning area.
Finally, returning to the elements we have tried to keep in mind in the
design of the distance-learning Spanish courses on the Internet, we stress
again the fact that the students have other group companions around them to
offer support and a tutor leading the way.
Joint interaction is achieved through the use of the communication tools
provided by the Internet. This communicative exchange can take place orally,
in written form, through the use of communication tools such as email, chat,
forums, audio and video-conferencing ... and whatever this fast-moving
technology comes up with next. With this in mind, we have tried to create an
environment capable of incorporating changes into this new medium, which is
often difficult to keep up with.
Our distance-learning Spanish courses arise from a mutual adaptation
process involving learning tasks, the aforementioned tools, the group of
students and their tutor. This said, we must now consider the real significance
of this communicative exchange in and of itself.
The technical aspects at work, making this exchange possible, allow us to
embrace the notion of electronic literacy (C. A. Chapelle 2001: 88):
As language learners are increasingly preparing for a life of interaction with
computers and with other people through computers (D.E. Murray, 1995),
their electronic literacy (Warschauer, 1999) becomes an additional target.
An argument about authenticity needs to address the question of the extent
to which the CALL task affords the opportunity to use the target language
in ways that learners will be called upon to do as language users, which
today includes a variety of electronic communication.
As for the tutor, his or her role corresponds to the profile of guide and
counsellor, as described by Holec, cited above (1979: 25). The tutor helps to
draw out the linguistic and communicative elements which constitute the aim
of study and encourages a satisfactory work pace. The level of commitment is
paced by students themselves but it must also, in turn, keep up with the pace
of the group. If they wish, then, Spanish course students can overcome the
isolation which is often inherent in distance-learning.
Added to this is the fact that, in this kind of environment, the student has to
224
13 Distance-learning Spanish courses: a follow-up and assessment system
Assessment System
Data storage
Students deal with our courses in an independent, autonomous manner and
require tools informing them on how much has been completed and how well.
At the outset of the course, students may remember what they have already
done, but after completing a number of exercises over a number of hours, or
if work has been held up for some time, students will need reminders. The
Assessment System was created to meet this need.
The minimum storage unit built into the design of the course is the exercise
(although an exercise can be made up of more than one computer screen page).
At the end of each exercise, an icon comes up to allow storage of data and
results.
It allows the system to pick up the students participation in the exercise,
evaluate and store the correct answers (further on we will find out what this is
based on) and indicate the time spent.
If the students are not satisfied with their performance they can repeat the
exercise. The computer also registers the number of attempts.
Therefore, it is the students who control the option of storing their work on
a particular exercise, along with the results obtained. This solution was chosen
over automatic storage of student participation to make students themselves
responsible for their own learning process. Proper progress depends on an
awareness of this fact.
By clicking on Assessment, an information table, which can be opened if
required, appears on the screen. The information it contains is twofold, as can
be seen below:
> The material covered, detailing what has not yet been accessed
(waiting), what has been accessed but not completed (under way) and
what material has been fully dealt with (completed).
> An automatic assessment of how the different exercises comprising
each piece of material have been done. This information is made available
in a simple colour code.
225
13 Distance-learning Spanish courses: a follow-up and assessment system
Figure 1
{
Curso A1 Pendiente
En curso
Terminado
Tema 1 Estado
Tema 2
{
Sin valoracin
Mejorable
Valoracin Adecuado
Tema 3 automtica Excelente
The information that the student receives in this figure deals with, on the
one hand, the status of the material that has been worked on, indicating lesson
by lesson whether the entire study route has been completed.
On the other hand, it presents the automatic assessment of what has been
completed up to that point, consisting of the percentage of correct answers
obtained in the stored exercises.
To this is added the opportunity of receiving more detailed information on
the lesson or the option of going back down to the next information level
corresponding to the stages of each lesson.
We can take a closer look.
Figure 2
226
13 Distance-learning Spanish courses: a follow-up and assessment system
By clicking on the Details icon and then the Report icon, the student
can:
Run down the table of contents showing the course structure to find the
level he or she is interested in.
Access the Assessment Report corresponding to the content of the selected
level.
The Report opens a table showing the students performance from different
perspectives.
The Final Evaluation reflects the percentage of correct answers obtained in
the exercises stored by the student in each stage.
The Interactive Graphic Story shows the number of attempts and the
evaluation of this material, which revises the lesson contents in a game format.
The Time Spent allows the student to refer to the time spent on the courses
and compare it with the marks obtained.
The Final Evaluation offers a compendium of the automatic follow-up
marks stored by the computer for the student in each lesson.
Student commitment
When learning a language, communication is fundamental, as is, therefore,
contact with other students or speakers.
We believe this to be just as important in distance-learning. It is
fundamental to set up a virtual learning community a group and encourage
the formation of motivating relationships among its members to help
them achieve their study aims (we also learn from other people outside this
group).
As J. Fernndez notes (1999: 78):
the imparting of knowledge also occurs from person to person, and so
another of the great benefits of the Internet is this: interactivity. The world
of education is ever more dynamic because now the students can spread
knowledge in ways never before imagined.
In the words of C. A. Chapelle (2001:32):
the experience crucial for individual cognitive development takes place
through interaction with others, and therefore key evidence for the quality
of a learning activity should be found in the discourse that occurs in the
collaborative environment.
For this reason, we have created a tool that provides information on the
students work pace and that of his or her group companions. This can be
accessed from the Automatic Follow-up option of the Assessment System.
When the student clicks on this tool, the computer provides illustrated
information in two coloured lines. One line represents the work pace of the
student requesting the report, whereas a second line in a different colour
227
13 Distance-learning Spanish courses: a follow-up and assessment system
represents the work pace of the study group to which he or she belongs.
In making this comparison possible, we have two aims in mind. On the one
hand, the students are encouraged to establish a comfortable work pace, in the
knowledge that they are the only people who can do this, in keeping with their
own particular circumstances. In any case, this is also unavoidable in the kind
of language course we have designed, because the communication exercises,
which are carried out simultaneously (via chat, for example) or time-delayed
(email), require answers that emerge from the didactic sequence, if progress is
to be made in the acquisition and mastery of linguistic and other contents.
On the other hand, the tutor, as guide, can immediately spot any students
who fall behind the group pace, and offer them support. As described in the
previous paragraph, the tutor may even find himself with another function,
namely that of work partner in the programmed communication exercises (if
a student falls behind the rest of the group and has no one with whom to do
the exercise).
This tool is geared only towards informing the student and backing up the
idea of group learning. At all times the student has a motivating point of
reference in his companions, thus avoiding any sense of isolation.
Courses A1 A2 A3 A4
Each lesson is built around ten work sessions lasting roughly an hour each.
The design of these work sessions includes prior presentation of new linguistic
and socio-cultural content and practice exercises graded from the more
controlled to the more open. At the end of each session, the student has to
contact the rest of the group to employ what he or she has learned in real
communicative interactions or exchanges.
228
13 Distance-learning Spanish courses: a follow-up and assessment system
Any student dealing alone with the acquisition of knowledge has to be sure
he or she is working correctly and that acquisition is occurring effectively.
To meet this requirement, Passport Controls or progress tests have been
built in every three work sessions. Therefore, there are three Passport Controls
in each lesson.
Passport Control
The Passport Controls are designed as a sequence of interactive exercises of a
variety of types (dragging, choosing, writing, etc.), in which students have to
show the knowledge and skills they have acquired (reading comprehension,
listening comprehension and writing skills). The more open oral
communication and written expression skills are assessed by the tutor.
This test differs from the other work-session interactive exercises,
automatically corrected by the computer, in that it provides a percentage mark
once all the exercises have been completed. Right and wrong answers are not
indicated. The underlying intention is to allow the students, in charge of their
own learning process, to go back and revise the exercises in which they have
obtained the poorest results (they can see this from the detailed report in
Automatic Follow-up). This done, they can return to Passport Control and try
to increase their percentage of correct answers.
In this approach we have taken as our lead authors such as A. Giovannini
et al. (1996: 29):
autonomy depends on our decision-making capacities and our ability to
accept responsibility, to self-evaluate and supervise our own learning
process...
In short, it is up to the students themselves to check whether they have
understood what they have studied.
As with the storage of exercise work, the students can decide to do the
Passport Control exercises without storing their answers if they are not ready
to have their attempts registered by the system.
229
13 Distance-learning Spanish courses: a follow-up and assessment system
Figure 5
230
13 Distance-learning Spanish courses: a follow-up and assessment system
characters provided. Following this, they define a personality for each of them
by giving them features. Finally, they have to contact another student and the
working pair then exchange information on their respective families.
As before, sending the results of the exchange to the tutor to receive his or
her advice and suggestions for improvement completes the task.
Figure 6
Rating scales
Each of these templates evaluates a particular skill. This evaluation comprises
a variable number of assessable areas (register, style, vocabulary, grammar,
etc.). We have adapted the proposals presented in Alderson, Wall and
231
13 Distance-learning Spanish courses: a follow-up and assessment system
Claphan (1998) and the instructions for examiners of the Basic Diploma in
Spanish as a Foreign Language (2001) for our own use.
The computer, in accordance with the areas evaluated by the tutor,
automatically calculates the total mark in percentages. The student also
receives suggestions for improvement.
In each assessable area there are three qualifications:
1. Excellent (>85%)
2. Satisfactory (>60% <85%)
3. Improvable (<60%)
232
13 Distance-learning Spanish courses: a follow-up and assessment system
VOCABULARY:
1. Satisfactory vocabulary, although occasionally imprecise.
2. Vocabulary frequently unsatisfactory. Vocabulary errors.
3. Vocabulary almost entirely imprecise with many errors.
Improvement suggestions:________________________________________________
Finally, we should keep in mind that on our courses students always have
access to their course tutors through the email, forum or noticeboard facilities.
They are free to consult them at any time, in the knowledge that help will
always be forthcoming.
Conclusions
The Assessment System was designed to complement the courses distance-
learning method and communicative approach. Basically, it provides students
with progress input from two main sources:
1. From an automatic follow-up and self-correction system.
2. From a system of assessment by course tutors.
These two information systems were designed to help students make
conscious, informed choices as to the appropriateness of certain strategies
when learning a foreign language.
The first system provides information on students achievements in
different areas of language learning. It stores information from presentation,
conceptualisation and practice exercises covering different linguistic and
socio-cultural areas of Spanish. These exercises involve discrete-point items
that can be computer-assessed. However, exercises requiring the production of
a creative text, such as a composition, or communicative exchange with a
partner, come up against the limitations of a computer system incapable of
assessing them.
Therefore, the second system was designed to help overcome these
limitations by the provision of tutors equipped to inform students of their
performance in meaningful tasks. Here they have to activate and integrate
knowledge to achieve different kinds of communicative goals, such as
creating and receiving texts in chats, emails, etc., using all the skills necessary
for communication.
The computers capacity to store information is one of the main advantages
of distance-learning, as it provides students with continuous follow-up
information which they can consult at any time. By requiring students to store
data and visit the automatic follow-up and self-correction system, we have
tried to make them aware that their learning process is to a large extent their
own responsibility.
The role of tutors is to offer specific improvement advice and help students
focus not only on linguistic and socio-cultural elements, but on learning
233
13 Distance-learning Spanish courses: a follow-up and assessment system
strategies as well. This makes the assessment system consistent with the
methodological principles of the Cursos, and its success depends on whether
the material stored from assessable exercises (electronic mail, chat
transcriptions, etc.) provides tutors with sufficient information to carry out a
proper needs analysis for each student.
The creators of the Assessment System have also wished to establish a
virtual study community by incorporating the work pace space. Here,
individual students can compare their work pace with that of the group. In this
way, a student who is falling behind can work with tutors and group
companions to seek out the reasons and find ways of avoiding isolation. This
is important if group exercises are to work.
In short, the success of the Assessment System depends largely on whether
or not the weighing of tasks between the automatic system and the tutorial
system, along with group work, gives students enough information to be able
to identify their problems and take charge of their own studies.
In the future, we should focus on two pedagogical issues that still provide
a challenge for creators of distance-learning systems. On the one hand, we
must find out how to equip self-evaluation systems with the means of
identifying and classifying any difficulties students might have in their
learning strategies and style, through specific, automatic messages designed to
help them think about and self-evaluate their work. On the other hand, to solve
the problem of assessing oral performance, we need communication tools,
such as chat systems and audio-conferencing, to be more reliable, more easily
accessible and faster.
The Assessment System, which is constantly revised and updated by
Instituto Cervantes staff, is implemented by a team of technicians at the
Institute of International Economics at the University of Alicante. Revision is
twofold: on the one hand, we check that the two elements of the Assessment
System (student-based and tutor-based) work in an integrated and consistent
manner. On the other hand, we use feedback from real groups of students
studying the Cursos.
234
13 Distance-learning Spanish courses: a follow-up and assessment system
References
Arrarte, G., A. Duque, G. Hita, O. Juan, J. M. Luzn, I. Soria y J. I. Snchez.
Instituto Cervantes. Cursos de espaol a distancia a travs de Internet. Una
experiencia de innovacin pedaggica del Instituto Cervantes. Congreso
Internacional de Informtica Educativa 2000, Universidad Nacional de
Educacin a Distancia.
Bates, A. W. 1995. Technology, Open Learning and Distance Education.
London: Routledge.
Chapelle, C. A. 2001. Computer Applications in Second Language
Acquisition. Foundations for Teaching Testing and Research. Cambridge:
Cambridge University Press.
Giovannini, A., E. Martn Peris, M. Rodrguez and T. Simn. 1996. El proceso
de aprendizaje, Madrid: Edelsa.
Holec, Henri 1979. Autonomie et apprentissage des langues trangres.
Council of Europe, Modern Languages Project. Nancy: Hatier.
Fernndez Pinto, J. 1999. Servicios telemticos: impacto en funcin de sus
caractersticas. Cuadernos Cervantes, 23. Madrid.
Soria Pastor, I. y J. M. Luzn Encabo. Instituto Cervantes. Un sistema de
seguimiento inteligente y evaluacin tutorizada para la enseanza y
aprendizaje de segundas lenguas a distancia y a travs de internet.
http://cvc.cervantes.es/obref/formacion_virtual/formacion_continua/-
ines.htm, 2000.
Bibliography
Alderson, J. Charles, D. Wall and C. Clapham. 1998. Exmenes de idiomas,
elaboracin y evaluacin. Madrid: Cambridge University Press.
Council of Europe. Modern Languages: Learning, Teaching, Assessment.
A Common European Framework of Reference. Strasbourg.
http://culture.coe.fr/lanf/eng/eedu2.4.html 25/05/01.
Diplomas de Espaol como Lengua Extranjera. 2001. Examen para la
obtencin del Diploma Bsico de Espaol como lengua extranjera del
Ministerio de Educacin y Cultura. D.B.E. Instrucciones para
examinadores. Ministerio de Educacin, Cultura y Deporte, Instituto
Cervantes, Universidad de Salamanca.
235
14
Certification of knowledge
of the Catalan language
and examiner training
This paper consists of two parts. The first part describes the system of
certificates of knowledge of the Catalan language resulting from the merger of
the two systems in force until 2001.
The second part describes the distance-training system for examiners using
a virtual training environment introduced by the Direcci General de Poltica
Lingstica (DGPL). This system was used for the first time in 2001 for initial
training and, from 2002 onwards, it has been used for the continuous training
of examiners.
237
14 Certification of knowledge of the Catalan language and examiner training
Background
The certificates of the Permanent Board for Catalan were based on those of the
Tribunal Permanent de Catal (Permanent Committee for Catalan), a body
created in 1934 by the Generalitat of Catalonia and chaired by Pompeu Fabra
during the Republican Generalitat; the certificates were introduced following
the recovery of the democratic institutions and self-government in 1979.
Since the Catalan language had been prohibited and excluded from the
curriculum, Catalan citizens had been unable to learn this language at school
and were thus unable to demonstrate their knowledge of it with academic
certificates. Once the official nature of Catalonias language had been
recognised, the certificates of the Permanent Board for Catalan were
introduced to create a system for evaluating knowledge of the Catalan
language for these citizens.
The International Certificate of Catalan was created by a mandate of the
Catalan Parliament in response to rising levels of Catalan language teaching
abroad and the need for accrediting this knowledge; this increase was brought
about by the growth of labour and academic mobility of EU citizens, within
the context of integration, and by increased immigration from non-EU
countries.
Over the last 20 years important social changes have been taking place in
Catalonia, mainly due to three factors: the academic requirement of being able
to use both official languages (Catalan and Spanish) normally and correctly by
the end of compulsory education; legislative changes affecting the legal
treatment of the Catalan language, and, finally, the organisation of
government bodies. Moreover, the experience accumulated over almost two
decades of assessing linguistic competence and the adaptation to new
238
14 Certification of knowledge of the Catalan language and examiner training
239
14 Certification of knowledge of the Catalan language and examiner training
240
14 Certification of knowledge of the Catalan language and examiner training
The environment
The advantages of a virtual training environment for preparing examiners are
as follows:
a) It allows the training of large, geographically-dispersed groups of
individuals without their having to travel.
b) The number of hours of training can be increased because it does away with
the time limits of on-site training (in our case, 1 day: 78 hours training)
with the possibility of performing between 15 and 20 hours training.
c) It enables examiners in training to work at their own pace (within the set
241
14 Certification of knowledge of the Catalan language and examiner training
242
14 Certification of knowledge of the Catalan language and examiner training
Training
Training material
Material was distributed on CD-ROM because a video was included and this
would have made downloading from the Internet difficult.
The initial training material for Proficiency certificate examiners which
involves oral and written assessment is described briefly below. For Basic
and Elementary Levels, training is only available for assessment of oral
expression, because written expression is marked centrally and correctors are
trained in on-site sessions.
The general browser bar for all the material is located in the top part of the
screen. From here, the user can access the following:
243
14 Certification of knowledge of the Catalan language and examiner training
244
14 Certification of knowledge of the Catalan language and examiner training
Examiners in training
Before starting training, examiners were given two days to familiarise
themselves with the virtual environment, material and communication system.
For training of the group in 2001, the full potential of the virtual
environment as a communication tool was not exploited because the group
was new to this type of system. For example, the Debate area allows group
discussion on the assessment of a sample of written and oral examinations,
which is a common practice in on-site sessions. It could therefore be said that
the training experience carried out in 2001 was a form of individual tutored
learning that did not exploit the possibilities of interaction and group work in
virtual training environments.
For continuous training in 2002, we assumed that the examiners were
already familiar with the virtual environment and the debate element became
the main focus of training in order to adjust the criteria of examiners to those
of the DGPL. The training used guided debates on aspects that the DGPL
considered should be brought to the attention of examiners for reflection and
analysis.
Although both tutors and examiners regarded this option favourably, tutors
had to make a great deal of effort to encourage examiners to take part in the
virtual debates proposed with varying results. Examiners were informed that
the result of the training depended on their participation in debates and that
they had to take part in each debate at least once to inform the rest of the group
about their opinion on the proposed topics. The overall success of the debate
and the number and quality of contributions varied from group to group; it is
hard to discern the reasons for this variation (tutors and examiners were
allocated to groups at random), but it could be that certain tutors or examiners
had the ability to encourage participation (a sense of humour, use of the forum
for informal discussion with examiners about cinema, books and other topics
not related to training are some elements that help to break the ice and
encourage participation).
Examiners in continuous training were assessed on the basis of the amount
and quality of their contributions. For example, an examiner who participates
as little as possible (a single contribution per debate) with poorly reasoned
contributions will be awarded a low mark. Depending on the case in question,
the DGPL may deem the candidate to have failed the training and to be thus
unfit to examine.
Tutors
As we said earlier, tutors underwent initial training on learning in virtual
environments. However, their relationship with the project began in the phase
prior to the development of training contents: in order to involve tutors in the
process and to ensure a solid knowledge of the agreed assessment criteria and the
arguments required to provide support, they participated actively in the analysis
and assessment of examination samples to be presented to examiners in training.
245
14 Certification of knowledge of the Catalan language and examiner training
The task of these tutors was mainly to guide examiners in training when
applying assessment grading scales and to return their practicals with the
appropriate comments and advice. To allow tutors to do so correctly, a low
tutor/examiner ratio was used: 1213 examiners in training for every tutor.
Moreover, the technical staff of the DGPL responsible for each exam and for
managing the environment provided support and advice to tutors throughout
the training period regarding any questions or problems that arose. The
ROOMS area of the environment, which acts as a staff room, enables this
support to be given almost instantaneously and ensures unified criteria for
resolving the questions of examiners about training content that were not
covered when the training material was validated, and for resolving problems
about the management of the virtual environment.
When training was completed, tutors had to prepare an assessment report
for each examiner in training on how suitable they would be for the task of
examining. On the basis of this report, the DGPL made its final decision as to
whether or not that individual would be fit to examine.
Table 1 illustrates overall data on the number of individuals involved in this
training system.
2001 421 42
2002 42 236 31
246
14 Certification of knowledge of the Catalan language and examiner training
247
15
CNaVT: A more functional
approach. Principles and
construction of a profile-
related examination system
Introduction
CNaVT (Certificaat Nederlands als Vreemde Taal Dutch as a Foreign
Language Certificate) is a government-subsidised, non-profit organisation that
was founded in 1975 and was placed under the auspices of the Nederlandse
Taalunie (NTU) in 1985. Since 1999, CNaVT has been affiliated with the
Centre for Language and Migration (CTM) of the Catholic University of
Leuven (KU Leuven) in Belgium and the University Language Centre (UTN)
at the University of Nijmegen (KUN) in the Netherlands. The NTU asked
Nijmegen and Leuven to develop a more functional certification structure,
with new proficiency tests in Dutch as a foreign language.
In this article we will focus first of all on the paradigm shift that led to this
new certification structure, and on the process of central test construction that
we went through. This construction process started with a needs analysis,
followed by a discussion on profile selection. We then present the selected
profiles that formed the basis of the new certification structure that is
presented in the following section. In this section, we also clarify the
relationship of the new certification structure with the development of a test
bank, which was another part of the NTU assignment. In the final section we
present the way the exact content and difficulty of the profiles is described.
For more information on the CNaVT examinations, please refer to the
websites of ALTE (www.alte.org/members/dutch) and CNaVT (www.cnavt.-
org).
249
15 CNaVT: A more functional approach
A paradigm shift
Learners aspire to a command of the Dutch language for a variety of reasons.
They might be going to work for a company where Dutch is the official
language; they may have family or friends in Belgium or the Netherlands, or
they may want to understand legal documents in Dutch. In short, most learners
want to learn and acquire a command of the Dutch language in order to be able
to function in certain areas of society. This implies that not every individual
has the same language needs, and thus diverse needs can be observed.
In many cases the individual and/or society needs proof of the desired or
required language proficiency (Humblet and van Avermaet 1995). Certificates
are relevant when a clearly defined form of language proficiency is a
prerequisite for admittance to a part of society or when it needs to be
established that somebody can function in a certain cluster of situations. This
diversity, variation and need for contextualisation has to be taken into account
when developing a new certification structure.
The old Dutch as a Foreign Language examination system did not start
from this perspective. It can be presented as a vertical progressive model
testing general language proficiency (see Figure 1). A distinction was made
between the examination levels, but not between the contexts in which
candidates wanted to function. The exams did not take into account the
language variation that can be observed in different communicative situations.
It was a vertical progressive system in the sense that many candidates first
took the lowest level (EK) and then often climbed the ladder to the highest,
paying for every rung on the ladder.
UK (advanced level)
BK (basic level)
EK (elementary level)
The new CNaVT is intended to reflect the needs of the target group more
closely. The new certificate should demonstrate that a sufficient command of
the language has been acquired in order for the learner to function in the
situations and domains in which he or she wishes to use the Dutch language.
To have a command of the language means that one is able to use it for all
kinds of communicative purposes. In other words: the exams will have to test
250
15 CNaVT: A more functional approach
Paradigm shift
functional
contextualised
language proficiency
profile-related
needs-related
NEEDS ANALYSIS
PROFILE DESCRIPTION
minimum outcomes
CEFR / ALTE
TEST CONSTRUCTION
251
15 CNaVT: A more functional approach
Needs analysis
All exams in the new examination system should have accreditation: a
recognition of the social relevance of the exam and a guarantee to both the
individual and society. This implies that the relation between what is tested
and the areas in which one wants to function is very important. The decisive
factor in gaining this accreditation is the degree to which the constructs
assessed in the examinations match the existing language proficiency needs.
The first step that had to be taken was to gain insight into the candidates
needs. We tried to assess the needs of students of Dutch all over the world
through a written questionnaire that was sent to a representative sample of
students and teachers of Dutch as a foreign language. A part of the
questionnaire consisted of spheres in which the student, using the Dutch
language, might want to function. Examples were law study, work in
business or living in the Netherlands or Flanders. In addition, space was
offered for specific extra spheres. A second part of the questionnaire contained
a list of 30 concrete situations in which students might want to use the Dutch
language after they had completed their course. This list was not exhaustive,
but consisted of a range of diverse situations within different domains.
Teachers were asked to check based on their experiences whether these
domains and situations were important, less important or unimportant for their
students. The students were asked to do the same.
The analysis showed that the questions relating to situations had the most
interesting results. We will therefore only present the results concerning the
situations. They were classified by means of principal factor analysis, in order
to detect underlying concepts. We used standard procedures such as the Scree
test to perform these analyses.
The analyses of the teachers data resulted in four separate dimensions that
were preliminarily designated as Business contacts (clustering situations
such as, for example, having a business meeting, making a business telephone
call, writing an article), Social contacts (e.g. buying in a shop, calling
relatives, contacting school teachers, following Dutch broadcasting), Study
(e.g. taking an exam, following a course) and Tourism (e.g. making a hotel
reservation, being shown around, reading tourist brochures). Social contacts
and Tourism were perceived as the most important areas. The outcomes for the
students were partly in line with the teachers data. Students made no
distinction between Social contacts and Tourism, but they too perceived these
to be the main areas in which they intended to use Dutch, again followed by
Study and Business contacts.
252
15 CNaVT: A more functional approach
Four profiles
The above quantitative and qualitative analyses led to the selection of the four
profiles presented in Table 1. The acronyms do not match the English labels
but rather the official Dutch labels.
After the selection of these profiles was completed, it was clear that they
were very much in line with the domains identified in the Common European
Framework of reference for languages (Council of Europe 2001) and by the
Association of Language Testers in Europe (2001), namely social/tourist,
work and study.
253
15 CNaVT: A more functional approach
254
15 CNaVT: A more functional approach
On the other hand, during their language course, learners (and their
teachers) would like to know where they stand and what they have learned.
Therefore the construction of a database of tests which was the second part
of the assignment by the Dutch Language Union (NTU) is important.
The test bank is intended to be a service for the teachers of Dutch as a
Foreign Language. Its aim is to make an inventory of existing tests, to make
them available to teachers (by means of a web-based search system) and to
stimulate teachers to exchange their tests. Three different types of test will be
put in the bank. In the first place teacher-made tests will be included. These
tests have been developed by the teachers themselves and are often used in
practice. In addition to these teacher-made tests, there will be space for
recognised tests such as the old CNaVT exams, for instance. Thirdly, the
project team is taking the initiative to develop tests or stimulate and supervise
the development of tests that fail to show up in the test database.
The test bank has a specific and very important goal that complements the
centrally administered official CNaVT examinations. The tests in the bank are
not aimed at the certification of a final attainment level and have no
accreditation. Their aim is rather to guide the learning process. They offer
teachers the possibility of assessing the level of their students language
proficiency on their way to the Dutch as a Foreign Language exams. Therefore
there is specific particular provision for tests situated at the in-between levels.
The tests in the bank will be described according to a number of general
parameters, such as level, profile and tested skills (reading, speaking,
grammar). They will be related to the CEFR and ALTE levels where
possible.
Profile description
The determination of final attainment levels or outcomes is a necessary next
phase in the development of profile tests. The outcomes describe what people
should be able to do with Dutch and at which level in order to function
within a certain profile.
A first step in this phase was to make an inventory of the language-use
situations that are relevant for each profile. We took as a starting point the
situations that were in the needs analysis questionnaire. A second step was to
look at the different possible language tasks people have to be able to fulfil in
these language-use situations. For this, inspiration was found in Coumou et al.
(1987).
In the enormous list of language tasks that resulted, a large amount of
overlap could be observed. The language tasks that showed overlap were then
clustered.
The next step was the description of the exact difficulty of each of the
selected language tasks. For this we used a set of parameters that were inspired
255
15 CNaVT: A more functional approach
Reading a tourist brochure understand and select relevant data from informative texts
256
15 CNaVT: A more functional approach
1. LISTENING
Input
understanding requests, descriptive informative, unknown/ informal/
wishes, complaints, ... persuasive known formal
understanding instructions descriptive prescriptive unknown/ informal/
known formal
understanding messages descriptive informative, unknown/ informal/
persuasive known formal
understanding occasional descriptive informative unknown/ informal/
expressions known formal
The candidate can determine the main message in requests, wishes or complaints (e.g. a
request made by a hotel owner to make less noise during the evening hours).
The candidate can determine the most important information in instructions (e.g. instructions
given by a police officer during parking).
The candidate can select relevant information from messages heard in everyday situations
(e.g. a route description, a personal description, a guided tour).
The candidate can recognise the most common conversational routines that arise at certain
occasions and react to them appropriately (e.g. birthday congratulations).
257
15 CNaVT: A more functional approach
Only after the detailed description of the four profiles is completed will a
solid comparison be possible with the Common European Framework of
Reference (CEFR 2001) and with the levels that were described by the
Association of Language Testers in Europe (ALTE 2001). Figure 4
preliminarily situates the CNaVT profiles within the CEFR level framework,
based on the profile descriptions as far as they are developed at this moment.
C2
C1
B2
B1
A2
A1
258
15 CNaVT: A more functional approach
References
Association of Language Testers in Europe. 2001. The ALTE framework. A
Common European Level System. Cambridge: Cambridge University Press.
Cordinatie-eenheid Prove (red.) 1996. Eindtermen Educatie. Amersfoort:
Prove.
Council of Europe, Modern Languages Division. 2001. Common European
Framework of Reference for Languages: Learning, Teaching, Assessment.
Cambridge: Cambridge University Press.
Coumou W., et al. 1987. Over de Drempel naar Sociale Redzaamheid.
Utrecht: Nederlands Centrum Buitenlanders.
Cucchiarini, C. and K. Jaspaert. 1996. Tien voor taal? Toetsen van
taalvaardigheid. In VON-Werkgroep NT2 (eds.) Taalcahiers: Taakgericht
onderwijs: een taalonmogelijke taak? Antwerpen: Plantyn.
Dienst voor Onderwijsontwikkeling. 2001. Adult Education. Modern
Languages Training Profiles. Brussels: Ministry of the Flemish
Community, Department of Educational Development.
Humblet, I., and P. van Avermaet. 1995. De tolerantie van Vlamingen ten
aanzien van het Nederlands van niet-Nederlandstaligen. In E. Huls. and J.
Klatter-Folmer (eds.) 1995. Artikelen van de Tweede Sociolingustische
Conferentie. Delft: Eburon.
259
16
Language tests in Basque
Nicholas Gardner
Department of Culture/Kultura Saila
Government of the Basque Country/Eusko Jaurlaritza
Spain
BAY OF BISCAY
Lapurdi
Lower Navarre
Biscay Gipuzkoa
Zuberoa
Araba Navarre
261
16 Language tests in Basque
Figure 2
Navarre
262
16 Language tests in Basque
Learners of Basque
Who learns Basque? Both native and non-native speakers. Most young native
speakers are now schooled in Basque. However, as in many less-used
languages, the term native speaker is not always very helpful: such people
have a very variable degree of command of the language, ranging from a
literate, educated standard through competent oral command with limited
literacy skills to minimal oral command, competent say in the home and with
friends, but with major difficulties in any more formal register, and with
limited reading ability. In addition, there are many second-language learners
particularly from Spanish-speaking families. In any given year over 300,000
schoolchildren now receive Basque language lessons and a fair proportion also
receive part or all of their education through the medium of Basque. To this
total must be added a rather more modest number of university students
studying at least part of their degree course through the medium of Basque and
around 45,000 adults receiving Basque lessons.
Motives are varied, but can conveniently be summarised under two
headings. For some the objective is developing and obtaining proof of ones
Basque language ability as a matter of pride in a cherished language; for others
the need for Basque is more instrumental, closely linked to the belief
sometimes more imagined than real that knowledge of Basque will improve
their options in the job market.
A number of examinations are available to candidates: all have to cater for
all comers, but the EGA examination run by the Basque Government is by far
the most popular at its level (C1/ALTE level 5). The following graph shows
enrolments and passes since its creation.
The rapid growth in enrolments in the early years led to continual
organisational difficulties and, in particular, problems in training a sufficient
number of reliable examiners.
The high failure rate suggests the need to develop a lower-level
examination so that weaker candidates can obtain recognition of their
attainments. Some lower-level examinations do exist, but they do not attract
great numbers of candidates, either because enrolment is limited to certain
groups or because of class attendance requirements.
263
16 Language tests in Basque
Figure 3 EGA
18000
16000
14000
12000
Students
10000
8000
6000
4000
2000
0
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
Enrolments Passes
264
16 Language tests in Basque
265
16 Language tests in Basque
Examination content
With the teaching of Basque being unofficial and with no formal training for
teachers available, it is hardly surprising that knowledge and practice of
language-teaching methods among teachers of Basque under the Franco
regime lagged behind those of teachers of the major European languages.
Even at the end of the 1970s, grammar translation methodology seems to have
been extremely common, with the first audio-lingual-based textbook
appearing at the end of the decade. Communicative methodology made an
even later appearance. Task-based learning is now popular, at least for the
teaching of Basque to adults.
In line with the grammar-translation tradition, the written paper of the
D titulua could contain more text in Spanish than in Basque; an oral
component seems not always to have been included. The experimental 1982
EGA examination, established as standard the following year, represented a
major departure: the cultural knowledge element was hived off and the
examination focused entirely on the four skills. The oral examination was
made a major permanent feature. After some initial experimentation with
cloze tests and other exercises, which turned out to be unacceptable to
examiners, the examination settled down to what is still its basic
configuration:
266
16 Language tests in Basque
267
16 Language tests in Basque
268
16 Language tests in Basque
Issues pending
From this overview it is evident that much work remains to be done to bring
EGA up to European standards. The examination body has adopted the ALTE
Code of Practice, though this has brought only minor change as previous
practice was largely along similar lines and the body has subsequently joined
ALTE itself; the main focus of concern is now the academic aspect of the
examination. The intention is to adapt it fully to the Common European
Framework, which will no doubt mean changes in exercises and marking
systems.
More statistical analysis of items would be desirable, though pre-testing in
such a small society is difficult. Creation of an item-bank would also be
desirable, but this runs against the tradition of rapid publication of past papers
for pedagogical purposes and, increasingly, against the demands of open
government, according to which some maintain that candidates have the right
to see their paper after the examination and to copy any part they wish.
Greater academic back-up would undoubtedly be an advantage, but
appropriate expertise through the medium of the Basque language is not easily
obtainable; expertise from major language experts, while welcome, tends to
fail to take specific local problems into account and needs adaptation.
269
17
Measuring and evaluating
competence in Italian as a
foreign language
Aims
This paper aims to give an overview of the assessment of abilities or
competencies in Italian as a foreign language, presenting the history and the
current situation of certification in Italy and stressing the importance of the
certification experience in order to promote a different culture of assessment
and evaluation in the Italian context.
The CELI examinations and qualifications structure will be also described.
271
17 Measuring and evaluating competence in Italian as a foreign language
272
17 Measuring and evaluating competence in Italian as a foreign language
273
17 Measuring and evaluating competence in Italian as a foreign language
Figure 1
CELI5
CELI4
CELI3
CELI2
CELI1
274
17 Measuring and evaluating competence in Italian as a foreign language
The number of CELI candidates has increased constantly during the last
eight years and the great majority of our candidates, as shown in the following
table, clearly prefer the June session to the November one.
275
17 Measuring and evaluating competence in Italian as a foreign language
There are also some considerations relating to the age and the gender of our
candidate population. They are mostly quite young, between 18 and 25 years
old, and 80% are women:
The data reported in the above two tables are important not merely from a
statistical or theoretical point of view; they need, in fact, to be taken into
account in the selection of the materials to be used in the exams.
276
17 Measuring and evaluating competence in Italian as a foreign language
277
17 Measuring and evaluating competence in Italian as a foreign language
Level descriptions
ALTE level 1 ALTE level 2 ALTE level 3 ALTE level 4 ALTE level 5
CELI1 CELI2 CELI3 CELI4 CELI5
Waystage FA2 Threshold FB1 Vantage FB2 Effective Mastery FC2
proficiency FC1
Can understand Can deal with Can understand Can understand Can understand
simple sentences most situations the main points longer and more virtually every
and expressions related to of concrete or complex texts, text heard or
used frequently tourism and abstract texts. making read. Can
in areas of travelling. Can Can interact with inferences. Can reconstruct
immediate need. produce simple native speakers express arguments and
Can exchange connected texts with an him/herself events in a
basic on familiar acceptable fluently, coherent and
information on topics or on degree of spontaneously cohesive
familiar and subjects of fluency and and effectively. presentation.
routine matters personal and spontaneity Can produce Can express
of a concrete everyday without well connected him/herself
type. interest. particular strain. and structured spontaneously,
Can explain texts showing very fluently and
his/her point of control of precisely. Can
view and express cohesive understand finer
opinions. devices. shades of
meaning.
278
17 Measuring and evaluating competence in Italian as a foreign language
Item types
Item types have to be selected first of all in accordance with a linguistic theory
the definition and description of the construct that we are going to assess
through the performance of candidates in the exams but also taking into
account practical considerations such as the number of candidates, which is
usually quite high in a certification context.
279
17 Measuring and evaluating competence in Italian as a foreign language
280
17 Measuring and evaluating competence in Italian as a foreign language
The rating scales have been formulated defining descriptors for four
assessment criteria: vocabulary control, grammatical accuracy, socio-
linguistic appropriateness and coherence.
Command and Knowledge and use Capacity to use the Capacity to produce
adequacy of the of morphosyntactic language in context, written texts that
lexical repertoire processes (formal respecting the aim indicate thematic
(vocabulary) in word modifications) of expressive ability continuity and
order to be able to and of connecting and in connection effective expressive
carry out an mechanisms. with the situation ability.
assigned task. and/or argument
Orthography treated.
281
17 Measuring and evaluating competence in Italian as a foreign language
282
17 Measuring and evaluating competence in Italian as a foreign language
These changes would certainly imply the production of different tasks for
the speaking component of the exam and new preparation and training for both
the interlocutors and the examiners.
Conclusions
We have provided a brief history of language certification in Italy and of the
certification experience of the University for Foreigners in Perugia. Of course
the system formulated in Perugia can be improved, and it will need some
changes in the near future, but nonetheless the certification experience has
made an important contribution both to the knowledge of Italian all over the
world and to the introduction of a new perspective in the field of assessment
and evaluation of competencies in Italian as a Foreign Language.
References
ALTE Handbook of Language Examinations and Examination Systems.
Baldelli, I. (ed.) 1987. La Lingua Italiana nel Mondo. Indagini sulle
Motivazioni allo Studio dellItaliano. Roma: Istituto della Enciclopedia
Italiana.
Grego Bolli, G. and M. G. Spiti. 1992. Verifica del Grado di Conoscenza
dellItaliano in una Prospettiva di Certificazione. Riflessioni, Proposte,
Esperienze, Progetti. Perugia: Edizioni Guerra.
Grego Bolli, G. and M. G. Spiti. 2000. La Verifica delle Competenze
Linguistiche. Misurare e Valutare nella Certificazione CELI. Perugia:
Edizioni Guerra.
Council of Europe. 2001. Common European Framework of Reference for
Languages: Learning, Teaching, Assessment. Cambridge: Cambridge
University Press.
Simone, R. 1989. Il Destino Internazionale dellItaliano. In Italiano ed
Oltre, 4: 105109.
Weir, C. J. 1993. Understanding and Developing Language Tests. Hemel
Hempstead: Prentice Hall.
283