Hello, and questions (long!)

Sari Elina Luoma (sluoma@tukki.jyu.fi)
Tue, 22 Feb 1994 20:08:19 +0200 (EET)

Hello all! I noticed in the past logs that you would like to have
introductions from new members of the list. If I didn't have questions
and problems interpreting my data, I might have remained silent longer,
but as it is, three days will suffice, I will occupy your screens for a
while. A longish while, I should warn. Those of you who are busy
right now might want to stop at this point. Some of you have already
seen some screenfuls of this through their private addresses - for you:
sorry for the repetition.

The world of PCP and grids is new to me, as is the world of psychological
theory in general; I am an applied linguist. Last summer I heard a paper
titled "What oral raters really pay attention to" where the grid was one
of the methods used, and as we do not have very many effective ways of
analysing what assessors are doing when they assess language test
performance, I was very interested. Since then I have read some books,
tried to understand what kinds of questions I can answer by employing a
grid approach, and how I should set up an experiment in order to do
so. I did set up an experiment, and I ran into some problems in trying
interpret the results.

The assessment analysis where I'm using the rating grid (1-7) is part of
a small-scale study to examine whether the tape-mediated test of speaking
we are now using as a part of a language proficiency test battery is good
enough as it is, or whether we should include a face to face subtest as
well. The most general question I am asking in the assessment analysis
is what are the assessors paying attention to, or, in more elevated
terms, how do the assessors construe the task of assessing oral language
test performance. The more specific question, methodologically an even
more difficult one at least for me, is whether the features assessed are
essentially the same, or whether there are differences in assessment foci
between the two tests. So far, all I can do is comparisons between
labels. It's better than nothing, but...

Three grids have now been completed by two raters. Two concerned the
face to face test, (12 and 15 performance extracts used as elements), and
the third the tape-mediated test (12 performance extracts as elements).
The elicitation question series used was: On which important assessment
criterion do two of these performances differ from the third? How can
you tell (refer to concrete features in the performances)? How would you
rate these three extracts on this feature on a 7-point scale? Describe a
level 1 and a level 7 performance on this feature. Then, the rest of the
elements are rated on this feature, and on to the next triad, and next
feature.

First problem: size and shape of the resulting grids. I cannot get sqare
grids - there simply don't appear to be more than eight (at the most)
important assessment features that can be considered to be different from
each other.

Second problem: I have these interesting-looking grids, the assessors have
gone over them again and verified that this is indeed what they feel, and
found explanations in them to the qualms they had regarding the
(previously made) holistic assessment of some of the performances, but: I
don't know how to analyse them. I have run correlations, they are
mostly very high, .7 to .9, some individual ones at .5-range. Naturally,
factor analysis produces one factor. This should mean there's only one
construct operating. In that case, I would call this construct language
ability, rather than any of the individual labels elicited. But what of
the labels, should I say they don't mean anything? But they do! They
are very real to the assessors, and clearly separable, rate of speaking
for instance, from extent of vocabulary, which is again different from
structural knowledge. It just so happens that language is an integrated
skill, as ability grows, all of these features improve, some more
linearlythan others. But this doesn't mean the features don't
exist, not to me at least. Will I ever be able to do more than label
comparisons?

I have Mancuso & Jaccard's PAREP and SELFGRID, but they will not
work with my present setup without substantial modifications, and I am in
the process of acquiring Metzler's PCGRID. For now, all I have is SPSS
for Windows.

I don't mean to underestimate the value (of sorts) of label comparisons,
qualitatively the whole process was very rewarding to the assessors, and
comparing the labels helps develop interesting hypotheses about language
test performance. Some almost self-evident differences were detected, we
just hadn't thought of them before in that way. But. Where's the proof, I
will be asked. And, as yet, I cannot answer that.

If anyone has any thoughts on this, I would much appreciate hearing
them. Thank you for your time.

Sari

Sari Luoma tel: +358 - 41 - 603 531
Language Ctr for Finnish Univ. fax: +358 - 41 - 603 521
Po Box 35, 40351 Jyvaskyla, FINLAND e-mail: sluoma@tukki.jyu.fi