Comma-Separated Values

March 17, 2009

The comma-separated values file format is commonly used to exchange tabular information, such as spreadsheets, between different programs and operating systems. Though common, the csv format has never been formally defined, and there are variations. One definition is at http://www.rfc-editor.org/rfc/rfc4180.txt.

We will define a csv file as containing ascii text, with records terminated by a line-termination sequence and fields containing zero or more characters separated by commas; a line-termination sequence is a carriage-return character, or a line-feed character, or both characters in either order. Leading and trailing space in unquoted fields is preserved. Fields may be surrounded by double-quote characters (ascii \042) may contain any characters; such fields may contain newlines, literal commas (ascii \054), and double-quote characters represented as two successive double-quotes. The field separator is normally a comma, but may be changed to an arbitrary character (often a semi-colon) in those European countries that use a comma instead of a decimal point.

The examples below show some of the kinky situations that may arise; for readability, the three-character sequence TAB represents a single tab character (ascii \011):

1,abc,def ghi,jkl,unquoted character strings 2,"abc","def ghi","jkl",quoted character strings 3,123,456,789,numbers 4, abc,def , ghi ,strings with whitespace 5, "abc","def" , "ghi" ,quoted strings with whitespace 6, 123,456 , 789 ,numbers with whitespace 7,TAB123,456TAB,TAB789TAB,numbers with tabs for whitespace 8, -123, +456, 1E3,more numbers with whitespace 9,123 456,123"456, 123 456 ,strange numbers 10,abc",de"f,g"hi,embedded quotes 11,"abc""","de""f","g""hi",quoted embedded quotes 12,"","" "",""x"",doubled quotes 13,"abc"def,abc"def","abc" "def",strange quotes 14,,"", ,empty fields 15,abc,"def ghi",jkl,embedded newline 16,abc,"def",789,multiple types of fields

Each of those records has five fields, shown below with the vertical bar character as a field separator:

1|abc|def ghi|jkl|unquoted character strings 2|abc|def ghi|jkl|quoted character strings 3|123|456|789|numbers 4| abc|def | ghi |strings with whitespace 5| "abc"|def | "ghi" |quoted strings with whitespace 6| 123|456 | 789 |numbers with whitespace 7|TAB123|456TAB|TAB789TAB|numbers with tabs for whitespace 8| -123| +456| 1E3|more numbers with whitespace 9|123 456|123"456| 123 456 |strange numbers 10|abc"|de"f|g"hi|embedded quotes 11|abc"|de"f|g"hi|quoted embedded quotes 12|| ""|x""|doubled quotes 13|abcdef|abc"def"|abc "def"|strange quotes 14||| |empty fields 15|abc|def ghi|jkl|embedded newline 16|abc|def|789|multiple types of fields

Write a function that retrieves records from a csv file. Be sure to provide a way to handle user-specified field-separators.

Posted by programmingpraxis

Filed in Exercises

6 Comments »

6 Responses to “Comma-Separated Values”

Connochaetes said
March 18, 2009 at 9:54 AM
A quick and dirty Ruby hack:

def separated_vals(filename, separator = ‘,’)
lines = Array.new
sep = Regexp.compile(“\\s*\\”+separator+”\\s*”)

File.open(filename) do |file_handle|
lines << file_handle.gets.split(sep)
end
return lines
end
programmingpraxis said
March 18, 2009 at 11:38 AM
I don’t think that properly handles all the cases.
Connochaetes said
March 18, 2009 at 2:32 PM
You’re quite right; it doesn’t quite work.

Here’s a better version:

def separated_vals(filename, separator = ‘,’)
lines = Array.new

File.open(filename) do |file_handle|
while not file_handle.eof? do
lines << file_handle.gets.strip.split(separator)
end
end
return lines
end

This works for all test cases with one omission – it keeps quotes around quoted strings. Now, instead of hacking further, I think I’ll sit down and write a proper state machine :)
FalconNL said
March 18, 2009 at 11:47 PM
Haskell (parses every test case correctly):

import Control.Applicative ((<*), (*>), (<*>), (<$>))
import Text.Parsec

main = readFile “csv.txt” >>= print . parseCSV ‘,’

parseCSV sep = parse (sepBy (line sep) (many1 $ oneOf “\r\n”)) “”
line sep = sepBy (field sep) (char sep)
field sep = option “” dropQuotes <+> (ws <+> (keepQuotes <|> plain sep)) <+> ws
plain sep = many (noneOf $ sep:”\r\n”)
dropQuotes = try (between’ “\”\”” quoted) <|> between’ “\”” quoted
keepQuotes = (string “\”” <+> quoted <+> string “\””)
quoted = many (noneOf “\”” <|> (head <$> try (string “\”\””)))
ws = many (oneOf ” \t”)
a <+> b = (++) <$> a <*> b
between’ s p = string s *> p <* string s [/sourcecode]
ftt said
February 10, 2014 at 9:50 PM
https://github.com/ftt/programming-praxis/blob/master/20090317-comma-separated-values/csv.py
Programming Praxis said
July 30, 2019 at 9:01 AM
[…] Reddit or Stack Overflow or someplace about handling CSV files with awk. We’ve done that in a previous exercise, but today I decided to handle CSV files in a different way. Specifically, I wrote an awk function […]

Programming Praxis