Author Topic: Programming 101 Lesson: Don't Purge User Data (Read 24311 times)

mouser · « **on:** September 30, 2007, 08:43 PM »

Here's a lesson to the programmers out there on the importance of preserving user account data, even when you are sick of them.

So today I went to buy someone a gift certificate at newegg.com, which for some ungodly reason will only let you use VISA credit cards to purchase said gift certifcates (and not MasterCard -- is there some kind of predjudice against credit card issuers I'm not aware of?).

That's ok, I have a Citibank MasterCard and a Chase Visa credit card (i'm not altogether unfamiliar with the concept of debt).

So I place my order and it is rejected. I go online to check my Chase account to see what's up.

My card account is gone.

Weird.

Ok so I call up Chase, whose automated system wants me to type in my account number. I do so and it complains and tells me i must have typed it in wrong. After smashing my palm on the keypad for a few seconds i am transferred to an operator.

The operator tells me my account doesn't exist. Am i sure i had an account with them? Only since 1997 I answer. "Well we aren't showing any such account." Then they do what they normally do which is basically wait for you to say thank you and go about your life and stop bothering them.

I surprise the operator by explaining that I'd kind of like to know what has gone wrong.

I spent the next 40 minutes being transfered to different supervisors, who all made me recite the same litany of stats about my card, my ss#, the name of my first dog, my mothers maiden name and her favorite color.

No one could find that my account ever existed in the first place. Basically I had been erased and disapeared from the history books as if i was a political prisoner.

The best guess anyone could come up with is that maybe i didn't use the card enough and kept paying off my balance every month, and so they canceled the account and *purged* it's existence from their records. In the end they just gave up and so did I. I'll just try to drop my memory of this card into the memory hole and hope it never comes back.

You might think the lesson from this story is that credit card companies are greedy evil vultures. But that is not the lesson. If you don't know that already then you need more remedial education.

The lesson from this story to programmers is this:

You do *not* purge user data. Ever. There is no scenario where drive space is so valuable that you need to purge user accounts.

If you want to disable, deactivate, terminate, cancel, whatever, an item or account in your database, you need to build in support for doing this into your software system. You need to plan ahead always that items are going to be marked as disabled and should not be treated as valid or in use, but still be able to retain them in your database.

You will need to be able to look up things that were once active and are no longer active. Whether it's a user inquiry or a situation where you need to undo your action, or where you need to recover old information for historical record keeping -- never delete accounts with information in them.

In general you should follow this rule whenever possible, not just with user accounts but with data of all sorts. That means building into your databases a kind of recycle bin, etc. Note also that *moving* something to the recycle bin should not involve losing data from that item. I.e. you must preserver where the item came from when you move it to recycled state.

lanux128 · « **Reply #1 on:** October 01, 2007, 12:27 AM »

i remember that last month my wife's internet banking account was "discontinued" due to "non-active" usage. and since we were ill-equipped to deal with the bureaucracy, we just decided to by-pass the whole re-activate process altogether and got another account at the same bank..

cranioscopical · « **Reply #2 on:** October 01, 2007, 12:43 AM »

A bit off-topic but...

I have a friend with a very significant sum on deposit in one bank account. My friend has made no transaction through this account for slightly over a year. The bank is making regular interest deposits, however, which is why the money was lodged there in the first place.

The bank just made a good attempt to declare the account dormant and scoop the funds. The bank states that the fact that it is making interest deposits to the account doesn't count as 'activity'.

Financial institutions seem to be making some very questionable decisions of late, with potential outcomes always in the banks' favour.

Grorgy · « **Reply #3 on:** October 01, 2007, 12:47 AM »

And banks wonder why we are cynical about their motives

OGroeger · « **Reply #4 on:** October 01, 2007, 03:04 AM »

Mouser, i respect your anger and suspect you are really frustrated, but i don't see your conclusion. User data should be purged after some term, for several reasons:

Some day you choke on the amount of data.
After some term the data is useless, e.g. as soon as all burden of proof and claims are outdated
Data protection, i think you must delete the data

I worked the last years on a product information managment system for huge companies (several millions of products) and my conclusion of the companies needs are not to store as much data as possible. That is what they did in the past. Now they have billions of data, they know that 90 percent are outdated and worthless, but they don't know how to find them. The challenge of today is not saving data but to keep data consistent.

The scandal with your account is that they removed your account silently, without telling you and without giving you a chance to prohibit it.

Grorgy · « **Reply #5 on:** October 01, 2007, 03:24 AM »

I worked for a telecommunications company and data preservation and the ability to find it was of great importance. It could not be left on the day to day databases but seperate archive databases with the relevant information were set up. Depending on the age of the data you were after it could be accessed almost immediately or at worst over night. Prior to computerization it was all done with microfiche, clumsy, a storage problem of its own, and not hugely secure, but it worked, mostly.

Product information would be another matter but where the customer is still alive and you are dealing with their money it seems incumbent upon the people with the control of these resources to have the data secure.

mouser · « **Reply #6 on:** October 01, 2007, 09:11 AM »

OGreger, in the past it may have been true that disk storage was expensive and slow enough that deleting data was something that people worried about.

It seems to me that this should really no longer be an issue and should be considered harmful. I'm not talking about keeping terrabytes of data on each user, and i'm not talking about refusing to delete sensitive security information that should be purged.

But I am saying that if you are writing a software system, and think to yourself "we can save 10% of the database size if we purge old accounts" then take a deep breath and don't do it.

In fact, I'd venture a rule of thumb that would go something like this, until you are talking about growing the database to 1000% of it's normal size, i would not even consider purging items from the database, and at that point you would do well to simply *move* such inactive accounts to a secondary database, and not eliminate the data.

Perhaps another way to view what I am trying to emphasize from a programming perspective is this:
As programmers when we are writing a system (for example a forum system), we often fail to plan for the idea of *disabled* items.

So imagine you are designing a forum system, which creates user accounts, posts, sections, etc. Now there will be times when you need to eliminate such user accounts, posts, sections, etc. Now ask yourself -- are these actions reversible, preserving all old information? Do you have a way to disable instead of purge these items? Are all your functions coded in such a way that they distinguish between "disabled" items and normal items? Or is your code set up such that you have to actually permanently purge such accounts to disable them.

Some "defensive" programming practices have become more commonplace in modern programming, such as test case generation and source control. But issues of Auditing/Logging¹ are much less common. I'd like to suggest this concept is an important one as well, which might be called: "Disabling over Deleting".

¹ recently i discovered i had made a change that had misconfigured the email address that requests for the silly cody kits were being sent. after a month had past that i figured out the form submissions were never being sent. i was able to recover all of the submissions because i had created in the script a backup procedure that always logged form submissions into a simple text file at time of submission -- not as convenient to parse but as a redundant backup system it worked perfectly.

OGroeger · « **Reply #7 on:** October 01, 2007, 12:37 PM »

Mouser, i think we are still talking about huge systems (user data of a credit card system) and believe me, your arguments are wrong on this scale. These systems are planned for project durations of several, often ten years. Designing a system that grows up ten years will definitely not work. And not because of the costs of disk storage. The costs for hardware are negligible in these projects. Lets say you need a good server for 50.000$. This is nothing compared to the other costs. The main point why this don't work is time. Because your database is constantly growing, its performance will constantly slow down. The query which took 0.1 ms in the beginning takes now 1 ms. The mouse click in the client which took 1 sec in the beginning takes now 30 sec. At this point the big manager will call you and will present you the following calculation: In the beginning 500 persons did a particular amount of work. 500 persons need 17.5 Mio $ salary the year. Now they can finish only half of the work because the application slows down day for day. So the same amount of work costs now 35 Mio $! I assure you, this manager won't be grateful that you offered him the possibility to resurrect the data of a user which died 5 years ago.
Others problems follow, e.g. the time for making backups grows from 5 minutes to 2 hours. This can be problematic when the company wants to make cold database backups (the database must be shutdown), because the down time is lost time....

Please don't misunderstand me. I'm on your side regarding 99% of the applications in this world. It's great to have undo and redo, and it's ok to disable a *reasonable* amount of data, but you can't make this a global programming rule.

If i couldn't convince you, i suggest one experiment: Let's simulate the "Disabling over Deleting". We don't remove files and directories anymore, but move them to recycle bin for one year and discuss the results in Oct 2008.

f0dder · « **Reply #8 on:** October 01, 2007, 06:19 PM »

Pft., as if a credit card customer database needs to be very large per-customer... bullshit. And does databases get consistently slower when client info is added? $:-\$

Target · « **Reply #9 on:** October 01, 2007, 10:09 PM »

seems to there's 2 parts to this...

1. the requirement for data retention. The amount and type of personal info that could/should/would be retained would depend on the data and the user/use, however I can't think of a single instance where someone might not make a reasonable claim for that data (under FOI, or auditory requirements, legal actions, etc)

2. the establishment and maintenance of a sensible data management process, ie live storage limits, archive timetables, etc.

in line with 1 above, it would seem appropriate for any business to retain all records of 'recent' transactions in a live database. Most (big?) businesses analyse this information as a marketing resource if nothing else. Obviously there are limits to the currency of any information, which is where 2 comes in, however there should always be the option to recover data which is deemed no longer current.

surely developers of a database of this type should be conscious of and make allowances for these requirements - this is a very sweeping generalisation but its just good practice

the amusing thing with mousers situation is that these institutions will aggressively seek your business via mailouts, telemarketing, advertising, etc, however they don't appear to value your business enough to retain either your business or your records....

another issue here (and slightly off topic I suspect) is what happens to his credit rating? if this had been his only credit record (and a very clean one it would seem) and there is no longer any record of it, does this not place him at a disadvantage?

app103 · « **Reply #10 on:** October 02, 2007, 01:28 AM »

Purging old data...and banks...

You can almost count on it that if he had defaulted on his account and it wasn't in good standing, that his info wouldn't have been lost. They would never have purged the data related to THAT kind of closed/inactive account. They would keep that forever.

Tekzel · « **Reply #11 on:** October 02, 2007, 07:33 AM »

I am from the BOFH school of thought: Users are useless, and by association, so is their data. Delete it randomly and without warning just to see them cry.

Ralf Maximus · « **Reply #12 on:** October 02, 2007, 09:37 AM »

Recently encountered a scenario similar to this (massive database of live data, >90% "too old" to matter). The solution we came up with:

1. Keep data that's routinely accessed in the "live" database.

2. Regularly archive older data to big, slow servers but still retain access from the main app. If a user requests data not in the live system, a secondary query is performed in the archives. If data is found there, it's reactivated -- moved back to live storage.

This was just implemented so I don't know how it will fare over the long haul, but so far so good. There's a noticeable lag when the archives are hit but in this case the results are worth it, since performance when accessing live data has improved dramatically. And users have access to everything, all the time.

OGroeger · « **Reply #13 on:** October 02, 2007, 03:02 PM »

Whow, that sounds interesting. How do you determine that data are "old" and should be moved to the archive?

Ralf Maximus · « **Reply #14 on:** October 03, 2007, 09:49 AM »

Every record has a "Date Created" and "Date Updated" field, standard MS date types. It makes aging of records simple. For a 90 day cut-off (Access database syntax):

SELECT * FROM Table WHERE DateUpdated < (Now - 90)

...or, SQL Server Syntax:

SELECT * FROM Table WHERE DateUpdated < (GetDate() - 90)

:-)

Lashiec · « **Reply #15 on:** October 03, 2007, 09:07 PM »

A simple but very clever idea. I like those ones

OGroeger · « **Reply #16 on:** October 04, 2007, 04:48 AM »

How do you face the situation when a record has been moved to the archive and afterwards a doublette has been inserted in the live db? I mean

a record has been moved to the archive
Someone inserts a new record with the identical business key as the archived record
The new record dates out and will be moved to the archive

I'm sorry to pester you with questions.

tinjaw · « **Reply #17 on:** October 04, 2007, 08:29 AM »

Someone inserts a new record with the identical business key as the archived record
-OGroeger (October 04, 2007, 04:48 AM)

There is your problem. You should never reuse such a key. Unique Primary Keys are the heart and soul of any database.

Ralf Maximus · « **Reply #18 on:** October 04, 2007, 08:30 AM »

No problem! Glad to help.

The system design prevents duplicating an archived key in most cases, and in other cases we kind of shrug in a distracted manner and mutter "don't care". Multiple duplicate keys aren't a problem, since all the ACTUAL key fields used to relate tables together are hidden from the user and remain unchangable. We NEVER allow user-manipulated data to link records together, so if somebody deletes the Master Record ID for a chunk of data, all they're doing is erasing a cosmetic item. The system chugs on. Likewise of they change a Customer ID to match another Customer ID the REAL linkages behind the scene don't change, but the next time they query for that ID they'll see two items listed in the results. And BTW if users are doing this kind of thing, they sort of get what they deserve.

The system was designed with the basic understanding that humans are very clever at entering duplicate data no matter how hard you try to stop them. We accept that and assume (some) duplication exists, but allow users to clean it up when they want.

Part of it is training, also. Before creating (say) a new Customer record, it is strongly urged that the user perform a search first. When the search fails (both live and archived data) only then is it a good idea to create a new Customer record with unique Customer ID. And the machine generates all the unique IDs, not the user.

Ah, but what about those diabolical customers who change their names everytime they call? In 2005 when they signed up they used "BRIAN T. DAMAGED" but now when they call back they're "BRIAN DAMAGED". Next time it might be "B.T. DAMAGED".... and if a user's not careful they might end up with three separate records for the same guy.

Well fortunately, the system does some pretty comprehensive searches. Querying for "BRAIN DAMAGER" would still match the original 2005 record. And all that's assuming the user doesn't know their unique Customer ID number, which oddly enough most of them do.

But I digress.

So what would happen in your suggested scenario, where the 2005 "BRIAN T. DAMAGED" has been archived, and some n00b creates a new 2007 "BRIAN T. DAMAGED" despite there already appearing to be a record for the guy? Despite the thunderous warnings from the system, the hooting alarms, the electrical shocks?

The system happily accepts the duplicate record, and will eventually archive it alongside the 2005 version when it grows old enough. If a smarter user stumbles across the two records and decides they should become one, it's a menu option to merge them.

Seems to work well, so far. And most of the above still applied before we implemented the archiving feature.

Does that make sense?

OGroeger · « **Reply #19 on:** October 04, 2007, 11:00 AM »

@tinjaw:

There is your problem. You should never reuse such a key. Unique Primary Keys are the heart and soul of any database.
-tinjaw (October 04, 2007, 08:29 AM)

The emphasis is "business". A business key is something that you can use to distinguish between records and to find and identify records, e.g. a user name when the your entity is user, or an isbn nummer when you manage books, or a social card number. It is something that you would use in the "from" part of a select statement. You should not use this as primary key. The primary key should not have a business meaning, only a technical. Best is you let the database manage the primary key, for instance in ms sql use an identity column. The background of my question to Ralf Maximus was that i was interested whether his business key is unique or not.

@Ralf Maximus:
Absolutely yes, this sounds mature and well designed. Thank you for the inside view.

tinjaw · « **Reply #20 on:** October 04, 2007, 11:26 AM »

OGroeger, I am going to have to disagree with you. If you are using it as a key, it should be unique. Regardless if it is a user name or a ISBN. Detach the key that identifies a *record* in a database from the *data* it holds, even if that "data" is used as a "key" is some other system. I explain this as the "Not My Database" problem. By using ISBNs you are relying on somebody else's key system to identify entities. This also locks you into using ISBNs. Use a layer of abstraction. If the people who dictate what an ISBN "is" in general, and what a specific ISBN "equates" to, then you are hosed. You are, in effect, tying your database schema to some other's schema. Primary keys are cheap.

Now, I am not advocating over normalizing your database to absurdum, but Best Practices in object oriented programming and frameworks have shown the value of adding reasonable layers of abstraction. What may seem set in stone today will seem like warm gelatin when somebody decides to change the rules without your consent.

Renegade · « **Reply #21 on:** October 04, 2007, 07:43 PM »

Very interesting discussion here, and on one of my favorite topics - databases!

I'm going to have to go with the "archive the data" thing. There's no excuse for deleting users from a database. That only means that you hired a shitty consulting firm to do an inadequate job when designing the system.

Like above - "Primary keys are cheap." Agreed 100%.

It's not hard to go an extra step in normalization to use a single table as a "key" table to store only the most essential data for a particular user. e.g. Primary key, full name, passport number & type as foreign key (int), or whatever. Your table is the really nice and lean and easily searched. Getting too big? Hardly possible anymore with the current state of hardware.

Concurrency issues? Just a bad design. Primary keys should only ever be in one place, so looking in multiple places just doesn't make sense. If you're spinning things off into another database for archived data, then there's a design issue there that needs to be resolved, but still there should be no issues with duplicate information.

Ok - I'm a bit radical with database design. I like pushing towards the 5th normal form as much as possible. Data should always be unique where possible.

Ah well - I'll go back to lurking.