Home | Blog | Software | Reviews and Features | Forum | Help | Donate | About us
topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • December 10, 2016, 12:33:25 PM
  • Proudly celebrating 10 years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: an abbreviation system pitfall  (Read 1488 times)

slowmaker

  • Supporting Member
  • Joined in 2009
  • **
  • Posts: 51
  • Reinventing the wheel can be fun!
    • View Profile
    • rambling-nerd
    • Read more about this member.
    • Donate to Member
an abbreviation system pitfall
« on: March 04, 2010, 02:55:30 PM »
(this is a slightly modified version of an article I put on my web site today; I thought some of the DC folks might find it interesting)
Some Things To Consider When Looking For An Abbreviation/Shorthand System

I've recently had reason to look into shorthand and abbreviation systems purporting to be suitable for American English, and decided to present here an aspect of those systems that I found interesting.

Shorthand systems can be a bit too enthusiastic in their claims of how much effort you will save. From a Rap Lin Rie (Dutton Speedwords-based) site:

"Learn these 10 shorthand words and increase your longhand writing speed by 25%.
the = l; of = d; and = &; to = a; in = i; a = u; that = k; is = e; was = y; he = s"

Another Rap Lin Rie page by the same person adds the following two words (still claiming 25% speed gain for the (now) 12-word list):
for = f, it = t.

So, we have the following list of words, reduced to single letter codes, that are claimed to increase writing speed by 25%:

the
of
and
to
in
a
that
is
was
he
for
it

This claim is not true, as will be shown later in this piece.

I should note that I am not anti-Dutton Speedwords. On the contrary, my wife and I are currently discussing the possibility of adopting Dutton Speedwords as a family 'private language', both for the fun of it and also for the brevity of speech it genuinely does appear to allow (but mostly for the Fun bit :)). Rap Lin Rie just serves as a good representative of the type of claims such systems tend to make.

The generally available word frequency lists can be misleading also, in the sense that would-be shorthand/abbreviation system writers can interpret these lists incorrectly in a couple of different areas. I should stress here that the various collections I am about to mention do not make any claims as to suitability for abbreviation systems. It is others, e.g. amateurs such as myself, who grab the top off these frequency lists and run with them too hastily.

According to a file hosted by the Shimizu English School, using the British National Corpus, these top frequency words account for 25% of written and spoken English:

the
be
of
and
a
in
to (infinitive-mkr)
have
it
to (prep)

These may well account for 25% of the word count, but a (typable) shorthand system should be concerned with the letters typed percentage, which can tell a different story.

Note that the various slants on the British National Corpus (spoken only, written only, sorted different ways) are available here). However, none of these lists are quite the thing an abbreviation-oriented shorthand system should be directly based on. You have to hack them up a bit, as we will see shortly.

From the Brown corpus (1960's written American English):

the      6.8872%
of       3.5839%
and      2.8401%
to       2.5744%
a        2.2996%
in       2.1010%
that     1.0428%
is       0.9943%
was      0.9661%
He       0.9392%
         -------
TOTAL   24.2286%


Again, there is the word count problem, but the Brown corpus actually has two more strikes against it:

1) It is a bit old.
2) It is based on written material, no spoken word stats at all.

Paul Noll, in his 'Clear English' pages, states that he/they created their own, more modern American English corpus by "taking forty newspapers and magazines and simply sorting the words and then counting the frequency of the words."

The time frame is somewhere between 1989 and 2005, since he mentions that "...the names Paul and George are on the list. This is no doubt due to Pope Paul and President George Bush at the time the list was compiled." You could presumably pin it down further if you wanted to contact the Nolls and ask which President Bush they were referring to, but that time frame is small enough for me (the language hasn't changed that much in 16 years).
Any road, their 'top ten' list follows:

the
of
and
a
to
in
is
you
that
it

This list has no stats and makes no claims other than the frequency order. It is at least modern and American English, but it still has the problem of being written, not spoken word oriented.

Now, let's take these lists and look at them from what I believe to be a more practical viewpoint. I am operating on the assumption that an abbreviation system should be oriented toward note-taking and audio transcription. There are other uses, certainly, but they often have the luxury of waiting to choose alternatives from popup lists (medical notes transcription aids), and that is not something that makes sense for someone trying to keep a smooth, fast flow of audio transcription going.

I believe this is a fair assumption also because many home-grown shorthand systems promote themselves as being great for exactly the situations mentioned above, especially note-taking (in lectures, for instance).

So, spoken word frequency makes sense for modern usage. That part is easy enough to see, I think.

Actual saved-letter-count, however, is not addressed in any of the shorthand/abbreviation systems I have seen. What I mean is that many systems seem to act as if adopting a shortcut for a particular word somehow eliminates all effort/time involved in writing that word. When you substitute 't' for 'the', for instance, you do not magically save three letters for every instance of 'the'. You save two letters for every instance of 'the'. That is obvious enough (once it's pointed out), but I believe it be one source of misconceptions regarding the true savings provided by a given abbreviation system. The less obvious bit is that less-frequent words may actually realize greater letter-count savings, so 'the' may not really be the number-one choice for greatest potential effort savings (and in fact, it is not, at least in spoken American English).

The other source of misconceptions is less important; it's just the tendency to react (emotionally?) to large words out of preportion to their actual frequency, even taking letter-count-savings into account. By this I mean that someone might put a lot of effort into creating abbreviations for a set of long words they hate typing, even though the stats show that those words simply don't show up often enough to be worth it. However, I will admit that the emotional factor does matter on an individual basis; if it feels like you've saved a lot of effort by creating a shortcut for 'somereallylongwordthatdoesnotactuallyshowupalot', then it may be worthwhile just for the psychosomatic benefit (or for the confidence in spelling). However, that sort of thing should not be allowed to shape the core of the system, it should just be something individuals tack on for themselves.

So, let's look at the previously mentioned lists from the standpoint of American English, spoken word only, with letters-saved-counts based on single letter abbreviations.

My source for this is the American National Corpus frequency data, Release 2 which contained 3862172 words (more than 14276929 characters). An Excel spreadsheet will be attached to this article; the spreadsheet will show the sorting and calculation formulas I performed on the original ANC data.

Percent Savings means 'how much less typing/writing will I have to do if I substitute a single letter for this' in the sense of how many fewer characters, not how many fewer words (your fingers/wrists don't care about words, they care about keypresses/pen strokes).

Dutton   Percent Savings
======   ===============
the           1.71
of            0.46
and           1.78
to            0.59
in            0.34
a             0.00
that          2.17
is            0.24
was           0.45
he            0.11
for           0.33
it            0.69
             -----
              8.87%

8.87% is certainly a respectable figure, well worth memorizing a simple system of 11 single-letter codes (I'm ignoring the 'a' for obvious reasons). However, it means that someone who genuinely expects to get 25% speed increase is bound to be greatly disappointed; the real speed increase is likely to be consonant with the savings in effort, i.e. about a third of the claimed speed increase.



(ANC Spoken, before I re-sorted by letter savings)
ANC     Percent Savings
===     ===============
i            0.00
and          1.78
the          1.71
you          1.47
it           0.69
to           0.59
a            0.00
uh           0.49
of           0.46
yeah         1.07
            -----
             8.26%

8.26% is not as much as one might expect from a system that takes into account even some of the grunts (for verbatim transcriptionists), is it? Yet, that is what I would get if I naively just swiped the top ten.



BNC     Percent Savings
===     ===============
the          1.71
be           0.12
of           0.46
and          1.78
a            0.00
in           0.34
to           0.59
have         0.75
it           0.69
            -----
             6.44

6.44%; if you based your single-letter abbreviation system on these words, you would get some disappointing results in terms of effort saved.



Brown    Percent Savings
=====    ===============
the           1.71
of            0.46
and           1.78
to            0.59
a             0.00
in            0.34
that          2.17
is            0.24
was           0.45
He            0.11
             -----
              7.85

7.85% is a far cry from the 24.23% an initial (knee-jerk) viewing of their stats would suggest, isn't it?




Noll    Percent Savings
====    ===============
the          1.71
of           0.46
and          1.78
a            0.00
to           0.59
in           0.34
is           0.24
you          1.47
that         2.17
it           0.69
            -----
             9.45

9.45% is the highest yet, and a little surprising to me; the implication appears to be that standard American newspaper vocabulary matches spoken word frequency better than any of the other lists.

Again, I know these lists (except Dutton) were not calculated for 'effort saving', but that's my point; a naive usage of the frequency tables available to us can create unrealistic expectations.

Now, let's look at what happens if you sort by the actual typed-letter savings:

ANC (spoken)  Percent Savings
============  ===============
that               2.17
and                1.78
the                1.71
you                1.47
know               1.10
yeah               1.07
they               1.01
have               0.75
it                 0.69
there              0.66
                  -----
                  12.42

12.42%, a definite winner for American English transcription purposes. Presumably, the same sort of somewhat counter-intuitive results would be obtained for U.K. English by re-sorting the British National Corpus lists the same way.

However, what if you aren't doing actual transcription, you're just taking notes in lectures, and therefore you don't have to take down every 'you know', 'yeah', 'and', 'to', etc.? Well, then, your list (indeed, all of the above lists) will look different, and you will get to include the 'magical' saving of a full three letters for 'the', two letters for 'to', and so forth. However, the typed-letter savings sorting strategy still applies, it's just that you have to move further down the sorted list to pick out the words you will bother to abbreviate instead of skipping entirely.

A full 24-letter code table using the typed-letter-savings-count approach would offer you a 20% savings in actual effort, not just superficial word count, and that is definitely nothing to sneeze at. So I am currently working on exactly that; a simple 24-letter code, no funky rules, just a straight substitution code that will theoretically save 20% of writing/typing effort.

I'm leaning toward doing both a transcription and a note-taking version. It seems to me that the number of common words that get dropped entirely in note-taking would necessitate a drastically different abbreviation set.

Notes on the ANC data in the spreadsheet:

1) I aggregated some of the data; the original lists words separately depending on the instances of a given part of speech usage, which is irrelevant to my purposes. However, I did not aggregate all duplicates of all words, so be careful if you try to use automated tools on this data set; it's not consistent in that respect.

2) Some of the entries are obviously not things you want for a single letter code system, endings of contractions and so forth, but they remain because their *letter count* still matters for the calculation of overall typing effort.

Attached is the ANC spoken word data in a spreadsheet (compressed to 1M with 7zip, expands to ~13M). The spreadsheet is large enough to crash my copy of Open Office Calc repeatedly (about every fifth operation), so I had to use Excel. Sometimes, Evil sucks less...
WinXP Home SP3 - PSPad 4.5.4
« Last Edit: March 04, 2010, 02:58:47 PM by slowmaker »

Stephen66515

  • Animated Giffer in Chief
  • Honorary Member
  • Joined in 2010
  • **
  • Posts: 3,131
    • View Profile
    • Donate to Member
Re: an abbreviation system pitfall
« Reply #1 on: March 05, 2010, 07:40:37 AM »
Great post, even though I got bored 1/2 way down due to the sheer amount of data lol.

I think the best short-hand available at the moment is 'txt lang' which is widely used, and very widely understood.

u wud nt av a prob readin dis if i typed lyk it wud ya

-Stephen