Non-Separating Commas in Comma Separated Values
Ill-formatted csv files are no uncommon phenomenon in the wild. This post explores how to deal with a particular problem in python
using regular expressions.
The Problem
The ill-formatted csv files contained values that represented enumerations or even full sentences which, unsurprisingly, themselves contained multiple commas. The files’ creators seemed to have thought of enclosing those fields with double quotes to make clear that the comma is not a separator. They have not, however, thought of the possibility that the enumerations themselves can contain double quotes resulting in a malformed csv file of the form
sample = '''column1,column2,column3
1,"text without quote, but comma",1
2,"text quoting "a sentence, words, and more" in a single cell",2'''
Now, when reading the file with pandas
the parser thinks that the third row consists of five columns instead of three and will throw a ParseError
:1
import io
import pandas as pd
pd.read_csv(io.StringIO(sample)) # this throws ParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 5
The Solution
To the human eye it is clear that the sample contains three columns and that the second field in the third row contains nested double quotes rather than three separate fields. The objective is to replace the inner double quotes to enable the parser to identify the string within the outer double quotes as a single field. This is achieved with a regular expression and a dynamic replacement of the string matched by the expression. Keep on reading for an explanation of the solution.
import re
pattern = r'(?P<keep1>^|,"[^"]*)(?P<innerQuote1>")(?!,)(?P<keep2>[^"]+)(?P<innerQuote2>")(?P<keep3>[^"]*",|$)'
def deal_with_inner_quotes(string):
# for example replace double quotes with single quotes
return string.replace('"', "'")
def repl(m):
parts = [
m.group('keep1'),
deal_with_inner_quotes(m.group('innerQuote1')),
m.group('keep2'),
deal_with_inner_quotes(m.group('innerQuote2')),
m.group('keep3'),
]
return ''.join(parts)
sample_fixed = re.sub(pattern, repl, sample)
pd.read_csv(io.StringIO(sample_fixed)) # this works as expected
Explanation
Identifying the Problematic Strings
This is the kind of task that is perfect to be tackled with regular expressions. The Python Standard Library comes with the module re
providing regular expression operations. The first step is to define a regular expression that matches the problematic fields.
The Outer Double Quotes
Let’s start with identifying the fields enclosed by double quotes (and checking for commas or beginning/end of strings using the non-capturing groups (?:...)
):
pattern = '(?:^|,)".*"(?:,|$)'
Applying the pattern
to our sample
, shows that it captures the relevant sub-strings:
import re
pattern = '(?:^|,)"(?!,).*"(?:,|$)'
matches = re.findall(pattern, sample)
print(matches) # a list of two strings: [',"text without quote, but comma",', ',"text quoting "a sentence, words, and more" in a single cell",']
The Inner Double Quotes
We now can identify all fields that are enclosed by double quotes. But we don’t need to actually modify all of them as the parser only gets confused by those that also contain inner double quotes. Let’s therefore modify the inner part of the pattern
to something more restrictive.
- The first outer and first inner double quote can be separated by an arbitrary, possibly empty string that is not a double quote which corresponds to the expression
[^"]*
. - Because the first inner double quote is the start of a new quote it must not be followed by a comma, so we add a negative lookahead
"(?!,)
. - Following the first inner double quote is an arbitrary but non-empty string that must not contain double quotes
[^"]+
.2 - Finally, the second inner double quote can be followed by an arbitrary, possibly empty string without double quotes
"[^"]*
.
Putting it all together yields the pattern
:
pattern = '(?:^|,)"[^"]*"(?!,)[^"]+"[^"]*"(?:,|$)'
Let’s apply it to our sample
:
pattern = '(?:^|,)"[^"]*"(?!,)[^"]+"[^"]*"(?:,|$)'
matches = re.findall(pattern, sample)
print(matches) # a list of a single string: [',"text quoting "a sentence, words, and more" in a single cell",']
Sanitising the Problematic Strings
Once we have identified the problematic string, we still have to sanitise it to make it palatable for the parser. The trick for me was to realise that re.sub
can take a function that takes the re.Match
object captured by pattern
as an argument. This enables a dynamic replacement of capturing groups in the string as opposed to a static replacement by a fixed value.
Let’s first look at what the function has to do. It takes a single argument m
which is a re.Match
object obtained from matching pattern
in sample
. The captured groups in the re.Match
object can be accessed via re.Match.groups
, so we can select the groups that we want to change (the inner double quotes) and the ones we want to keep (everything else).
By default, match groups can be accessed by their integer indices. But to improve readability, let’s update pattern
with named capturing groups using the syntax (?P<name>...)
. The matched groups can then be accessed by their name re.Match.group('<name>')
.
pattern = r'(?P<keep1>^|,"[^"]*)(?P<innerQuote1>")(?!,)(?P<keep2>[^"]+)(?P<innerQuote2>")(?P<keep3>[^"]*",|$)'
We can use the group names to get a captured group from the re.Match
object and apply a custom function deal_with_inner_quotes
to selected groups. This will look something like this:
def deal_with_inner_quotes(string):
# for example replace double quotes with single quotes
return string.replace('"', "'")
def repl(m):
parts = [
m.group('keep1'),
deal_with_inner_quotes(m.group('innerQuote1')),
m.group('keep2'),
deal_with_inner_quotes(m.group('innerQuote2')),
m.group('keep3'),
]
return ''.join(parts)
print(re.sub(pattern, repl, sample)) # 'column1, column2, column3\n1,"text without quote, but comma", 1\n2"text quoting \'a sentence, words, and more\' in a single cell" 2\n'
Footnotes
-
pandas.read_csv
provides an argumenton_bad_lines
which is set toerror
by default. To avoid theParseError
and discard the ill-formed lines the user can simply set it towarn
orskip
. ↩ -
Let’s for simplicity assume that there is no third nested level of double quotes. ↩