Home | Blog | Software | Reviews and Features | Forum | Help | Donate | About us
topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • December 14, 2017, 04:09 AM
  • Proudly celebrating 10 years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: IDEA Data extraction and convert to CSV  (Read 894 times)

magician62

  • Supporting Member
  • Joined in 2011
  • **
  • Posts: 128
    • View Profile
    • Donate to Member
IDEA Data extraction and convert to CSV
« on: November 26, 2017, 10:30 AM »
I have a number of large files which contain multiple records of two columns of data. The first column is effectively a field name, the second the data. There are multiple records in the file so am looking to extract the data to a CSV with field names as the first row.

The delimiter will need to be the @ symbol as the data may contain a comma. I can format the data to have a delimiter after fieldname and after data if it helps.

Original data source is a word doc, which will be exported to txt for processing. Final destination will be Access

To make it a little harder, not all fieldnames exist in every record, but the first one will always be there.

Ideally the snack will allow the user to define the word in the first column which is the start of a record. Not all rows are really needed, so either they can all be captured and later processed to remove un-needed fields and data. Or if a facility for the user to define the fields existed, I can enter the fieldnames as needed for each document type.

Data example

fieldname1@data1
fieldname2@data2
fieldname3@data3
fieldname4@data4
fieldname1@data1
fieldname2@data2
fieldname3@data3
fieldname4@data4
fieldname1@data1
fieldname3@data3
fieldname4@data4


Alternatively I can have each record in a seperate text file, if that would be easier

Hopefully someone can help, as there are several thousand records in the documents
« Last Edit: November 26, 2017, 10:47 AM by magician62 »

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,080
    • View Profile
    • Donate to Member
Re: IDEA Data extraction and convert to CSV
« Reply #1 on: November 26, 2017, 11:54 AM »
A question (and more may follow)
- Is the data-part limited to a single line, or can it be multiple lines?

magician62

  • Supporting Member
  • Joined in 2011
  • **
  • Posts: 128
    • View Profile
    • Donate to Member
Re: IDEA Data extraction and convert to CSV
« Reply #2 on: November 26, 2017, 12:07 PM »
All data single line, and would hazard no more than 200 characters.

wraith808

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 9,043
    • View Profile
    • Donate to Member
Re: IDEA Data extraction and convert to CSV
« Reply #3 on: November 26, 2017, 01:04 PM »
So taking your data example, what would the output ideally be?

And you say you have Word and Access- do you have Excel?

magician62

  • Supporting Member
  • Joined in 2011
  • **
  • Posts: 128
    • View Profile
    • Donate to Member
Re: IDEA Data extraction and convert to CSV
« Reply #4 on: November 26, 2017, 01:25 PM »
I do have Excel, and I know I can transpose rows/columns, but as far as I know it may not be possible for this case. I am using Office 2007 just in case new features have appeared since

Output format

fieldname1 fieldname2 fieldname3
data1         data2         data3
data1                          data3
data1         data2         data3

Using the @ symbol for delimitation


Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,080
    • View Profile
    • Donate to Member
Re: IDEA Data extraction and convert to CSV
« Reply #5 on: November 26, 2017, 01:26 PM »
I've come to this, so far:
Input:
fieldname1@data1
fieldname2@data2
fieldname3@data3
fieldname4@data4
fieldname1@data1
fieldname2@data2
fieldname3@data3
fieldname4@data4
fieldname1@data1
fieldname3@data3
fieldname4@data4
Output:
fieldname1,fieldname2,fieldname3,fieldname4
data1,data2,data3,data4
data1,data2,data3,data4
data1,,data3,data4
And after adding a few commas in the field3 data:
fieldname1,fieldname2,fieldname3,fieldname4
data1,data2,"data,,3",data4
data1,data2,"data,3",data4
data1,,"data,,,3",data4

Missing columndata sample:
fieldname1@data1
fieldname2@data2
fieldname4@data4
fieldname1@data1
fieldname3@data3
fieldname4@data4
fieldname1@data1
fieldname3@data3
With this output: (attention: columns are in the order found in the input!)
fieldname1,fieldname2,fieldname4,fieldname3
data1,data2,data4,
data1,,data4,data3
data1,,,data3

magician62

  • Supporting Member
  • Joined in 2011
  • **
  • Posts: 128
    • View Profile
    • Donate to Member
Re: IDEA Data extraction and convert to CSV
« Reply #6 on: November 26, 2017, 02:42 PM »
That looks like it will work. The order of result columns is not important, as the columns can be re-ordered or handled in a query later.

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,080
    • View Profile
    • Donate to Member
Re: IDEA Data extraction and convert to CSV
« Reply #7 on: November 26, 2017, 03:17 PM »
So, I've written this in Java, and running it requires a Java runtime environment to be installed on your computer (Windows or Linux).

Doc2CSV
Version: 1.1.0.0
Released: 2017-11-28

Purpose
Convert a 'column@data' formatted file to a .csv file.

First column name found triggers a 'next record'.
Columns are in the order they are found in the file, so adding a complete first record gives you control over column ordering.
Empty lines, or lines without the @ separator are ignored.
Whats new:
Separator and delimiter can be changed using command-line parameters.
Outputfile can be specified from the command-line.

Installation
  • Download the zip file attached to this post
  • Download and install an Oracle Java JRE (I've been testing with Java 8, but Java 9 should work AFAICS)
  • Unzip the contents of the zipfile to it's own directory (keeping the directory structure intact, on upgrade: overwrite all files)
  • On Linux: rename the start-run.sh script to start-run and make it executable

Running the conversion
  • Open a command-prompt in the application directory
  • Export the .doc file as text
  • execute: start-run {exported-text-file} (a full path/filename can be used, see below for optional parameters)
  • Wait for processing to complete (should be a few seconds)
  • Check the output in the created file (same as the inputfile but with .csv appended, any previous outputfile will be overwritten)

Command-line parameters
  • -s {separator character} : specifies the separator character used in the input file
  • -d {delimiter character} : specifies the delimiter character used in the output file
  • {input filename} : first non option (-s or -d) specified, required!
  • {output filename} second non option (-s or -d) specified
NB: The separator or delimiter character support specifying \t for using the Tab character.
NB: If {output filename} is not specified or equal to {input filename} (name compare only!), then {input filename}.csv is used.
NB: The order of parameters is not important, except for the {input filename} and {output filename}.

Possible enhancements
  • Make the column separator (,) configurable and/or a parameter
  • Make the input-data separator (@) configurable and/or a parameter
  • Do you need a GUI? It can be added but will take some time
  • Process .doc files directly (rather easy, but because of response time not done yet)
« Last Edit: November 28, 2017, 02:24 PM by Ath, Reason: Updated to 1.1.0.0 »

magician62

  • Supporting Member
  • Joined in 2011
  • **
  • Posts: 128
    • View Profile
    • Donate to Member
Re: IDEA Data extraction and convert to CSV
« Reply #8 on: November 26, 2017, 03:46 PM »
Many thanks, have downloaded. As it is late here at the moment, and I need to be up early, I will likely test later tomorrow.

magician62

  • Supporting Member
  • Joined in 2011
  • **
  • Posts: 128
    • View Profile
    • Donate to Member
Re: IDEA Data extraction and convert to CSV
« Reply #9 on: November 26, 2017, 04:32 PM »
So decided to have a little play, few issues, but that was my source data! :)

No need for process from Doc.

GUI would I assume make frequent use easier, and I can see the use growing over time.
Input and column separator options would also be good.

For the moment I have placed the text source in the same folder as the util, which only takes a few seconds.

My test used 9 records, 3 of which didn't process correctly due to issues. One was found to be a duplicate. A bonus!

Target

  • Honorary Member
  • Joined in 2006
  • **
  • Posts: 1,650
    • View Profile
    • Donate to Member
Re: IDEA Data extraction and convert to CSV
« Reply #10 on: November 26, 2017, 11:46 PM »
should be able to do this (import a txt file, and parse to a table.query) directly using access

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,080
    • View Profile
    • Donate to Member
Re: IDEA Data extraction and convert to CSV
« Reply #11 on: November 27, 2017, 12:54 PM »
should be able to do this (import a txt file, and parse to a table.query) directly using access
Well, I haven't installed MS Access on any computer near me for many years, so that wasn't a viable option. I was almost going to ask if it wasn't supposed to read MS SQL instead of MS Access...
And being a Java/C# style programmer, VBA feels really awkward to me (though I've done small VB/VBA projects in the more distant past).

wraith808

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 9,043
    • View Profile
    • Donate to Member
Re: IDEA Data extraction and convert to CSV
« Reply #12 on: November 27, 2017, 01:17 PM »
should be able to do this (import a txt file, and parse to a table.query) directly using access
Well, I haven't installed MS Access on any computer near me for many years, so that wasn't a viable option. I was almost going to ask if it wasn't supposed to read MS SQL instead of MS Access...
And being a Java/C# style programmer, VBA feels really awkward to me (though I've done small VB/VBA projects in the more distant past).

I think he meant the OP could do it, just as until you posted, I was going to do it in Excel (didn't think about directly doing it in Access)

Target

  • Honorary Member
  • Joined in 2006
  • **
  • Posts: 1,650
    • View Profile
    • Donate to Member
Re: IDEA Data extraction and convert to CSV
« Reply #13 on: November 27, 2017, 03:17 PM »

I think he meant the OP could do it, just as until you posted, I was going to do it in Excel (didn't think about directly doing it in Access)

yup, thats where I was going.

I haven't used access for years, but I have done this before so I know its doable (could be he could do it directly from word too given how well integrated office is...)

magician62

  • Supporting Member
  • Joined in 2011
  • **
  • Posts: 128
    • View Profile
    • Donate to Member
Re: IDEA Data extraction and convert to CSV
« Reply #14 on: November 28, 2017, 06:58 AM »
Oh! I missed all the fun :) Totally lost a day there. If anyone finds it, please let me know.

As to direct import in Access, for myself quite sometime since I imported data not in CSV format, and never in the above format. If I say I remember Access 2.... :)

wraith808

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 9,043
    • View Profile
    • Donate to Member
Re: IDEA Data extraction and convert to CSV
« Reply #15 on: November 28, 2017, 08:06 AM »
If the solution provided is good enough for you, just file it away for next time.  :Thmbsup:

magician62

  • Supporting Member
  • Joined in 2011
  • **
  • Posts: 128
    • View Profile
    • Donate to Member
Re: IDEA Data extraction and convert to CSV
« Reply #16 on: November 28, 2017, 10:13 AM »
The original solution works. And that is what is needed at the end of the day. Hope to put it to a bigger test later on today. :)

Ath

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 3,080
    • View Profile
    • Donate to Member
Re: IDEA Data extraction and convert to CSV
« Reply #17 on: November 28, 2017, 02:26 PM »
I've updated Doc2CSV to version 1.1.0.0, adding command-line parameters for separator, delimiter and output filename, defaults same as v1.0.0.0.
Download above from the original release-post, including updated usage info.

magician62

  • Supporting Member
  • Joined in 2011
  • **
  • Posts: 128
    • View Profile
    • Donate to Member
Re: IDEA Data extraction and convert to CSV
« Reply #18 on: November 29, 2017, 02:02 AM »
Many thanks. I have downloaded. Will check when we get back, assuming we don't have snow after we go out :)