Calc: Text Columns Not Quoted When Saving .csv

Rich_Shepard · December 21, 2012, 9:52pm

I've been going around in circles on this issue for most of today without
finding the solution. The context:

I have several large data files provided in .csv format that need to
be edited and have column formats made explicit.

These files are recognized by LO/Calc as spreadsheets when I open
them and I change the field separator from a tab to a comma. They display
correctly in the spreadsheet.

I explicitly format text, integer, and floating point columns by
highlighting the column(s) and selecting Format -> Cells and selecting the
appropriate choice.

The problem comes when I save the explicitly-formatted file: the text
fields are quoted only when they have content, not when they are blank.
This, of course, prevents correctly reading the files into database tables
or R data frames. I have tried all sorts of kludges to force Calc to display
the .csv dialog box where I can tell it to quote all text fields, but those
cells without content are saved as , rather than ,'',.

What do I do so that blank cells in text-formatted columns are quoted in
the saved (or saved-as) .csv file?

TIA,

Rich

Tom_Davies · December 22, 2012, 12:17am

Hi
My ham-fisted kludge would probably be to save the Csv and then open it in a text-editor, such as Kate, Gedit, Notepad or whatever and then do search&replace to look for
, ,
and replace with
, " ",

Perhaps in Calc add an unusual combination of characters in thse empty cells.

Actually i have just tried this in Calc and Gnumeric. Text cells didn't get any " nor ' marks around them except cells with just spaces in them and formatted as text but only in Gnumeric, not in Calc. Then i noticed that when you "Save As ..." there is a tick-box to "Edit filter preferences" and that gives a pop-up dialogue-box with a tick-box to "Quote all text fields". Only then it starts putting " around spaces and letters. Empty cells still have nothing.

Is that something to do with Utf8 coding?
Regards from
Tom

Brian_Barker · December 22, 2012, 12:41am

The problem comes when I save the explicitly-formatted file: the text fields are quoted only when they have content, not when they are blank.

I don't think that's true: I think empty cells are output as nothing, but blank cells - those in which the text string is one or more blanks - are output correctly in quotes.

This, of course, prevents correctly reading the files into database tables or R data frames. I have tried all sorts of kludges to force Calc to display the .csv dialog box where I can tell it to quote all text fields, but those cells without content are saved as , rather than ,'',. What do I do so that blank cells in text-formatted columns are quoted in the saved (or saved-as) .csv file?

Could you convert your empty cells to blank, output as CSV, read this into your other program, and then - only if necessary - edit the blanks back to null?

This might be more difficult is you have both genuinely empty and blank cells and if these are significantly different; in that case, could you put some other string into the empty cells and edit these out in the destination program?

Can your other programs cope only with CSV files? Not spreadsheet document files?

I trust this helps.

Brian Barker

Rich_Shepard · December 22, 2012, 1:18am

My ham-fisted kludge would probably be to save the Csv and then open it in
a text-editor, such as Kate, Gedit, Notepad or whatever and then do
search&replace to look for

Tom,

I use emacs for almost all my writing.

, ,
and replace with
, " ",

The problem with searching for , and replacing them with ,'', is that
blank number cells are also quoted. SQL does not like that.

Perhaps in Calc add an unusual combination of characters in thse empty
cells.

The first thing I did was save the freshly-opened .csv as .ods. I get
client data as exports from their Excel spreadsheets (sigh) which I save as
.ods, then clean up. When I save the .ods file as .csv the text fields --
even the blank ones -- are quoted. That's not working with these files.

I'm trying (with C-h) to replace all blank cells with NA. However, with
105000 rows and about 50 columns it takes a very long time ... I gave up for
today. But, I'll try again tomorrow. Then I can block number columns and
replace the NA with blanks.

Quite the hassle,

Rich

Rich_Shepard · December 22, 2012, 1:21am

Could you convert your empty cells to blank, output as CSV, read this into
your other program, and then - only if necessary - edit the blanks back to
null?

Brian,

One does not 'read' the file into a SQL database such as postgres. When a
column attribute type is text SQL wants to see it quoted; when it's numeric
(INTEGER, REAL) it cannot be quoted. For the record, DATE attributes, even
in ISO format (YYYY-mm-dd) must be quoted for SQL to properly interpret it.

Can your other programs cope only with CSV files? Not spreadsheet
document files?

Nope. SQL wants certain formats, and R might be able to read spreadsheets
but I work with plain ASCII text files for almost everything.

Rich

planas · December 22, 2012, 1:30am

Rich,

In a csv file the field delimiter is usually a comma and to denote what is a string that may contain a comma quotation marks are used. If the cell in the original format was empty or NULL in a database nothing is exported. For example the export from a database might look like this:

1,"text string, test","the last two were NULLs in the database",34
2."another string",3,3,57

For the exported value to an empty string ("") the original value must be an empty string "" not an empty cell. When importing into a database is it often better to have the equivalent to a NULL value because the each column in the database will have an assigned data type and the empty string will cause an import error if there is data type mismatch. The empty string will cause problems with numeric data.

Tom_Davies · December 22, 2012, 1:41am

Hi
Errr, have you tried reading the Sql tables directly with LO? Base is best when used as a front-end for an existing back-end and i think Postgres is one that Base works particularly well with at the moment. I think the Base - Postgre
connector was at least partly developd by people from the Postgre project fairly recently (within the last year i think)

If the connector lets you use the Postgre database then it should be possible to edit the tables in Calc dynamically without having to go through any Csv file import/export processing.

Regards from
Tom

Rich_Shepard · December 22, 2012, 2:04pm

Tom,

The source data are .csv files. I need to clean them up and reformat them
for insertion into tables. That is, each row needs to start with INSERT INTO
<tablename> (column_names) VALUES ( and end with );. I put the CREATE TABLE
schema at the top of the file.

Thanks,

Rich

Rich_Shepard · December 22, 2012, 2:08pm

In a csv file the field delimiter is usually a comma and to denote what is
a string that may contain a comma quotation marks are used. If the cell in
the original format was empty or NULL in a database nothing is exported.
For example the export from a database might look like this:

1,"text string, test","the last two were NULLs in the database",34
2."another string",3,3,57

Jay,

I know this and that's why I don't understand why postgres balks at
inserting the rows with NULL values in text columns.

For the exported value to an empty string ("") the original value must be an empty string "" not an empty cell. When importing into a database is it often better to have the equivalent to a NULL value because the each column in the database will have an assigned data type and the empty string will cause an import error if there is data type mismatch. The empty string will cause problems with numeric data.

That's what I'm doing: replacing empty cells with NA strings. I aborted
the first attempt yesterday afternoon because it was late and the size of
the data file (~20 Mb) was taking too long to complete. I'll do this today.

Thanks,

Rich

Rich_Shepard · December 22, 2012, 2:45pm

The reason for the problem, and its solution, finally penetrated my
consciousness so I resolved the issue.

Usually, the data I get in spreadsheet format has a value in each cell so
I can write the SQL INSERT INTO statement without specifying column names for
each value; each row is complete. With these data, however, there are
missing values which are different in each row. Therefore, I need to either
specify the attributes with values for each row in the INSERT INTO statement
or indicate blanks with NULLs.

Because each spreadsheet can have 19-100 columns and up to 105,000 rows,
hand-crafting an INSERT INTO statement for each file would require a huge
amount of time. Ergo, I'm filling in blank cells with NA which is easily
converted to NULL once the table exists in postgres.

Thanks for all the suggestions, and the best of the holiday season to
everyone.

Rich

planas · December 22, 2012, 4:03pm

In a csv file the field delimiter is usually a comma and to denote what is
a string that may contain a comma quotation marks are used. If the cell in
the original format was empty or NULL in a database nothing is exported.
For example the export from a database might look like this:

1,"text string, test","the last two were NULLs in the database",34
2."another string",3,3,57

Jay,

I know this and that's why I don't understand why postgres balks at
inserting the rows with NULL values in text columns.

Check the column restrictions, often the column must be defined to accept NULL. This is true with MS SQL Server and MySQL/MariaDB. If the column definition does not explicitly allow NULL values you must enter an appropriate placeholder value (empty string, 0, etc)

planas · December 22, 2012, 4:07pm

Happy holidays also.

The table schema require each line to have 100 distinct, if empty, columns and the data fills a variable number of columns. Did not think of that as the problem but it would be for any relational database.

Rich_Shepard · December 22, 2012, 4:15pm

Jay,

The column separator, a comma, is present for each column, but those
without content are not automagically assigned NULL when inserting so the
data appears to not have the proper number of columns. So, a simple UPDATE
command will change the NA to NULL in each table.

Rich