Comma-Separated Values

March 17, 2009

The comma-separated values file format is commonly used to exchange tabular information, such as spreadsheets, between different programs and operating systems. Though common, the csv format has never been formally defined, and there are variations. One definition is at http://www.rfc-editor.org/rfc/rfc4180.txt.

We will define a csv file as containing ascii text, with records terminated by a line-termination sequence and fields containing zero or more characters separated by commas; a line-termination sequence is a carriage-return character, or a line-feed character, or both characters in either order. Leading and trailing space in unquoted fields is preserved. Fields may be surrounded by double-quote characters (ascii \042) may contain any characters; such fields may contain newlines, literal commas (ascii \054), and double-quote characters represented as two successive double-quotes. The field separator is normally a comma, but may be changed to an arbitrary character (often a semi-colon) in those European countries that use a comma instead of a decimal point.

The examples below show some of the kinky situations that may arise; for readability, the three-character sequence TAB represents a single tab character (ascii \011):

1,abc,def ghi,jkl,unquoted character strings
2,"abc","def ghi","jkl",quoted character strings
3,123,456,789,numbers
4, abc,def , ghi ,strings with whitespace
5, "abc","def" , "ghi" ,quoted strings with whitespace
6, 123,456 , 789 ,numbers with whitespace
7,TAB123,456TAB,TAB789TAB,numbers with tabs for whitespace
8, -123, +456, 1E3,more numbers with whitespace
9,123 456,123"456, 123 456 ,strange numbers
10,abc",de"f,g"hi,embedded quotes
11,"abc""","de""f","g""hi",quoted embedded quotes
12,"","" "",""x"",doubled quotes
13,"abc"def,abc"def","abc" "def",strange quotes
14,,"", ,empty fields
15,abc,"def
ghi",jkl,embedded newline
16,abc,"def",789,multiple types of fields

Each of those records has five fields, shown below with the vertical bar character as a field separator:

1|abc|def ghi|jkl|unquoted character strings
2|abc|def ghi|jkl|quoted character strings
3|123|456|789|numbers
4| abc|def | ghi |strings with whitespace
5| "abc"|def | "ghi" |quoted strings with whitespace
6| 123|456 | 789 |numbers with whitespace
7|TAB123|456TAB|TAB789TAB|numbers with tabs for whitespace
8| -123| +456| 1E3|more numbers with whitespace
9|123 456|123"456| 123 456 |strange numbers
10|abc"|de"f|g"hi|embedded quotes
11|abc"|de"f|g"hi|quoted embedded quotes
12|| ""|x""|doubled quotes
13|abcdef|abc"def"|abc "def"|strange quotes
14||| |empty fields
15|abc|def
ghi|jkl|embedded newline
16|abc|def|789|multiple types of fields

Write a function that retrieves records from a csv file. Be sure to provide a way to handle user-specified field-separators.

Pages: 1 2

6 Responses to “Comma-Separated Values”

  1. Connochaetes said

    A quick and dirty Ruby hack:

    def separated_vals(filename, separator = ‘,’)
    lines = Array.new
    sep = Regexp.compile(“\\s*\\”+separator+”\\s*”)

    File.open(filename) do |file_handle|
    lines << file_handle.gets.split(sep)
    end
    return lines
    end

  2. programmingpraxis said

    I don’t think that properly handles all the cases.

  3. Connochaetes said

    You’re quite right; it doesn’t quite work.

    Here’s a better version:

    def separated_vals(filename, separator = ‘,’)
    lines = Array.new

    File.open(filename) do |file_handle|
    while not file_handle.eof? do
    lines << file_handle.gets.strip.split(separator)
    end
    end
    return lines
    end

    This works for all test cases with one omission – it keeps quotes around quoted strings. Now, instead of hacking further, I think I’ll sit down and write a proper state machine :)

  4. FalconNL said

    Haskell (parses every test case correctly):

    import Control.Applicative ((<*), (*>), (<*>), (<$>))
    import Text.Parsec

    main = readFile “csv.txt” >>= print . parseCSV ‘,’

    parseCSV sep = parse (sepBy (line sep) (many1 $ oneOf “\r\n”)) “”
    line sep = sepBy (field sep) (char sep)
    field sep = option “” dropQuotes <+> (ws <+> (keepQuotes <|> plain sep)) <+> ws
    plain sep = many (noneOf $ sep:”\r\n”)
    dropQuotes = try (between’ “\”\”” quoted) <|> between’ “\”” quoted
    keepQuotes = (string “\”” <+> quoted <+> string “\””)
    quoted = many (noneOf “\”” <|> (head <$> try (string “\”\””)))
    ws = many (oneOf ” \t”)
    a <+> b = (++) <$> a <*> b
    between’ s p = string s *> p <* string s [/sourcecode]

  5. […] Reddit or Stack Overflow or someplace about handling CSV files with awk. We’ve done that in a previous exercise, but today I decided to handle CSV files in a different way. Specifically, I wrote an awk function […]

Leave a comment