Programming 101 Lesson: Don't Purge User Data

ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

Other Software > Developer's Corner

<< < (2/5) > >>

Grorgy:
I worked for a telecommunications company and data preservation and the ability to find it was of great importance. It could not be left on the day to day databases but seperate archive databases with the relevant information were set up. Depending on the age of the data you were after it could be accessed almost immediately or at worst over night. Prior to computerization it was all done with microfiche, clumsy, a storage problem of its own, and not hugely secure, but it worked, mostly.

Product information would be another matter but where the customer is still alive and you are dealing with their money it seems incumbent upon the people with the control of these resources to have the data secure.

mouser:
OGreger, in the past it may have been true that disk storage was expensive and slow enough that deleting data was something that people worried about.

It seems to me that this should really no longer be an issue and should be considered harmful. I'm not talking about keeping terrabytes of data on each user, and i'm not talking about refusing to delete sensitive security information that should be purged.

But I am saying that if you are writing a software system, and think to yourself "we can save 10% of the database size if we purge old accounts" then take a deep breath and don't do it.

In fact, I'd venture a rule of thumb that would go something like this, until you are talking about growing the database to 1000% of it's normal size, i would not even consider purging items from the database, and at that point you would do well to simply *move* such inactive accounts to a secondary database, and not eliminate the data.

Perhaps another way to view what I am trying to emphasize from a programming perspective is this:
As programmers when we are writing a system (for example a forum system), we often fail to plan for the idea of *disabled* items.

So imagine you are designing a forum system, which creates user accounts, posts, sections, etc. Now there will be times when you need to eliminate such user accounts, posts, sections, etc. Now ask yourself -- are these actions reversible, preserving all old information? Do you have a way to disable instead of purge these items? Are all your functions coded in such a way that they distinguish between "disabled" items and normal items? Or is your code set up such that you have to actually permanently purge such accounts to disable them.

Some "defensive" programming practices have become more commonplace in modern programming, such as test case generation and source control. But issues of Auditing/Logging1 are much less common. I'd like to suggest this concept is an important one as well, which might be called: "Disabling over Deleting".

1 recently i discovered i had made a change that had misconfigured the email address that requests for the silly cody kits were being sent. after a month had past that i figured out the form submissions were never being sent. i was able to recover all of the submissions because i had created in the script a backup procedure that always logged form submissions into a simple text file at time of submission -- not as convenient to parse but as a redundant backup system it worked perfectly.

OGroeger:
Mouser, i think we are still talking about huge systems (user data of a credit card system) and believe me, your arguments are wrong on this scale. These systems are planned for project durations of several, often ten years. Designing a system that grows up ten years will definitely not work. And not because of the costs of disk storage. The costs for hardware are negligible in these projects. Lets say you need a good server for 50.000$. This is nothing compared to the other costs. The main point why this don't work is time. Because your database is constantly growing, its performance will constantly slow down. The query which took 0.1 ms in the beginning takes now 1 ms. The mouse click in the client which took 1 sec in the beginning takes now 30 sec. At this point the big manager will call you and will present you the following calculation: In the beginning 500 persons did a particular amount of work. 500 persons need 17.5 Mio $ salary the year. Now they can finish only half of the work because the application slows down day for day. So the same amount of work costs now 35 Mio $! I assure you, this manager won't be grateful that you offered him the possibility to resurrect the data of a user which died 5 years ago.
Others problems follow, e.g. the time for making backups grows from 5 minutes to 2 hours. This can be problematic when the company wants to make cold database backups (the database must be shutdown), because the down time is lost time....

Please don't misunderstand me. I'm on your side regarding 99% of the applications in this world. It's great to have undo and redo, and it's ok to disable a *reasonable* amount of data, but you can't make this a global programming rule.

If i couldn't convince you, i suggest one experiment: Let's simulate the "Disabling over Deleting". We don't remove files and directories anymore, but move them to recycle bin for one year and discuss the results in Oct 2008.

f0dder:
Pft., as if a credit card customer database needs to be very large per-customer... bullshit. And does databases get consistently slower when client info is added? :-\

Target:
seems to there's 2 parts to this...

1. the requirement for data retention. The amount and type of personal info that could/should/would be retained would depend on the data and the user/use, however I can't think of a single instance where someone might not make a reasonable claim for that data (under FOI, or auditory requirements, legal actions, etc)

2. the establishment and maintenance of a sensible data management process, ie live storage limits, archive timetables, etc.

in line with 1 above, it would seem appropriate for any business to retain all records of 'recent' transactions in a live database. Most (big?) businesses analyse this information as a marketing resource if nothing else. Obviously there are limits to the currency of any information, which is where 2 comes in, however there should always be the option to recover data which is deemed no longer current.

surely developers of a database of this type should be conscious of and make allowances for these requirements - this is a very sweeping generalisation but its just good practice

the amusing thing with mousers situation is that these institutions will aggressively seek your business via mailouts, telemarketing, advertising, etc, however they don't appear to value your business enough to retain either your business or your records....

another issue here (and slightly off topic I suspect) is what happens to his credit rating? if this had been his only credit record (and a very clean one it would seem) and there is no longer any record of it, does this not place him at a disadvantage?

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version