# frozen_string_literal: true
# = csv.rb -- CSV Reading and Writing
# Created by James Edward Gray II on 2005-10-31.
# See CSV for documentation.
# Welcome to the new and improved CSV.
# This version of the CSV library began its life as FasterCSV. FasterCSV was
# intended as a replacement to Ruby's then standard CSV library. It was
# designed to address concerns users of that library had and it had three
# 1. Be significantly faster than CSV while remaining a pure Ruby library.
# 2. Use a smaller and easier to maintain code base. (FasterCSV eventually
# grew larger, was also but considerably richer in features. The parsing
# core remains quite small.)
# 3. Improve on the CSV interface.
# Obviously, the last one is subjective. I did try to defer to the original
# interface whenever I didn't have a compelling reason to change it though, so
# hopefully this won't be too radically different.
# We must have met our goals because FasterCSV was renamed to CSV and replaced
# the original library as of Ruby 1.9. If you are migrating code from 1.8 or
# earlier, you may have to change your code to comply with the new interface.
# == What's the Different From the Old CSV?
# I'm sure I'll miss something, but I'll try to mention most of the major
# differences I am aware of, to help others quickly get up to speed:
# * This parser is m17n aware. See CSV for full details.
# * This library has a stricter parser and will throw MalformedCSVErrors on
# * This library has a less liberal idea of a line ending than CSV. What you
# set as the <tt>:row_sep</tt> is law. It can auto-detect your line endings
# * The old library returned empty lines as <tt>[nil]</tt>. This library calls
# * This library has a much faster parser.
# * CSV now uses Hash-style parameters to set options.
# * CSV no longer has generate_row() or parse_row().
# * The old CSV's Reader and Writer classes have been dropped.
# * CSV::open() is now more like Ruby's open().
# * CSV objects now support most standard IO methods.
# * CSV now has a new() method used to wrap objects like String and IO for
# * CSV::generate() is different from the old method.
# * CSV no longer supports partial reads. It works line-by-line.
# * CSV no longer allows the instance methods to override the separators for
# performance reasons. They must be set in the constructor.
# If you use this library and find yourself missing any functionality I have
# trimmed, please {let me know}[mailto:james@grayproductions.net].
# See CSV for documentation.
# == What is CSV, really?
# CSV maintains a pretty strict definition of CSV taken directly from
# {the RFC}[http://www.ietf.org/rfc/rfc4180.txt]. I relax the rules in only one
# place and that is to make using this library easier. CSV will parse all valid
# What you don't want to do is to feed CSV invalid data. Because of the way the
# CSV format works, it's common for a parser to need to read until the end of
# the file to be sure a field is invalid. This consumes a lot of time and memory.
# Luckily, when working with invalid CSV, Ruby's built-in methods will almost
# always be superior in every way. For example, parsing non-quoted fields is as
# == Questions and/or Comments
# Feel free to email {James Edward Gray II}[mailto:james@grayproductions.net]
require_relative "csv/fields_converter"
require_relative "csv/match_p"
require_relative "csv/parser"
require_relative "csv/row"
require_relative "csv/table"
require_relative "csv/writer"
using CSV::MatchP if CSV.const_defined?(:MatchP)
# This class provides a complete interface to CSV files and data. It offers
# tools to enable you to read and write to and from Strings or IO objects, as
# The most generic interface of the library is:
# csv = CSV.new(string_or_io, **options)
# # Reading: IO object should be open for read
# csv.read # => array of rows
# # Writing: IO object should be open for write
# There are several specialized class methods for one-statement reading or writing,
# described in the Specialized Methods section.
# If a String is passed into ::new, it is internally wrapped into a StringIO object.
# +options+ can be used for specifying the particular CSV flavor (column
# separators, row separators, value quoting and so on), and for data conversion,
# see Data Conversion section for the description of the latter.
# # From a file: all at once
# arr_of_rows = CSV.read("path/to/file.csv", **options)
# CSV.foreach("path/to/file.csv", **options) do |row|
# arr_of_rows = CSV.parse("CSV,data,String", **options)
# CSV.parse("CSV,data,String", **options) do |row|
# CSV.open("path/to/file.csv", "wb") do |csv|
# csv << ["row", "of", "CSV", "data"]
# csv << ["another", "row"]
# csv_string = CSV.generate do |csv|
# csv << ["row", "of", "CSV", "data"]
# csv << ["another", "row"]
# # Core extensions for converting one line
# csv_string = ["CSV", "data"].to_csv # to CSV
# csv_array = "CSV,String".parse_csv # from CSV
# CSV { |csv_out| csv_out << %w{my data here} } # to $stdout
# CSV(csv = "") { |csv_str| csv_str << %w{my data here} } # to a String
# CSV($stderr) { |csv_err| csv_err << %w{my data here} } # to $stderr
# CSV($stdin) { |csv_in| csv_in.each { |row| p row } } # from $stdin
# CSV allows to specify column names of CSV file, whether they are in data, or
# provided separately. If headers are specified, reading methods return an instance
# of CSV::Table, consisting of CSV::Row.
# # Headers are part of data
# data = CSV.parse(<<~ROWS, headers: true)
# data.class #=> CSV::Table
# data.first #=> #<CSV::Row "Name":"Bob" "Department":"Engineering" "Salary":"1000">
# data.first.to_h #=> {"Name"=>"Bob", "Department"=>"Engineering", "Salary"=>"1000"}
# # Headers provided by developer
# data = CSV.parse('Bob,Engineering,1000', headers: %i[name department salary])
# data.first #=> #<CSV::Row name:"Bob" department:"Engineering" salary:"1000">
# CSV allows to provide a set of data _converters_ e.g. transformations to try on input
# data. Converter could be a symbol from CSV::Converters constant's keys, or lambda.
# # Without any converters:
# CSV.parse('Bob,2018-03-01,100')
# #=> [["Bob", "2018-03-01", "100"]]
# # With built-in converters:
# CSV.parse('Bob,2018-03-01,100', converters: %i[numeric date])
# #=> [["Bob", #<Date: 2018-03-01>, 100]]
# # With custom converters:
# CSV.parse('Bob,2018-03-01,100', converters: [->(v) { Time.parse(v) rescue v }])
# #=> [["Bob", 2018-03-01 00:00:00 +0200, "100"]]
# == CSV and Character Encodings (M17n or Multilingualization)
# This new CSV parser is m17n savvy. The parser works in the Encoding of the IO
# or String object being read from or written to. Your data is never transcoded
# (unless you ask Ruby to transcode it for you) and will literally be parsed in
# the Encoding it is in. Thus CSV will return Arrays or Rows of Strings in the
# Encoding of your data. This is accomplished by transcoding the parser itself
# Some transcoding must take place, of course, to accomplish this multiencoding
# support. For example, <tt>:col_sep</tt>, <tt>:row_sep</tt>, and
# <tt>:quote_char</tt> must be transcoded to match your data. Hopefully this
# makes the entire process feel transparent, since CSV's defaults should just
# magically work for your data. However, you can set these values manually in
# the target Encoding to avoid the translation.
# It's also important to note that while all of CSV's core parser is now
# Encoding agnostic, some features are not. For example, the built-in
# converters will try to transcode data to UTF-8 before making conversions.
# Again, you can provide custom converters that are aware of your Encodings to
# avoid this translation. It's just too hard for me to support native
# conversions in all of Ruby's Encodings.
# Anyway, the practical side of this is simple: make sure IO and String objects
# passed into CSV have the proper Encoding set and everything should just work.
# CSV methods that allow you to open IO objects (CSV::foreach(), CSV::open(),
# CSV::read(), and CSV::readlines()) do allow you to specify the Encoding.
# One minor exception comes when generating CSV into a String with an Encoding
# that is not ASCII compatible. There's no existing data for CSV to use to
# prepare itself and thus you will probably need to manually specify the desired
# Encoding for most of those cases. It will try to guess using the fields in a
# row of output though, when using CSV::generate_line() or Array#to_csv().
# I try to point out any other Encoding issues in the documentation of methods
# This has been tested to the best of my ability with all non-"dummy" Encodings
# Ruby ships with. However, it is brave new code and may have some bugs.
# Please feel free to {report}[mailto:james@grayproductions.net] any issues you
# The error thrown when the parser encounters illegal CSV formatting.
class MalformedCSVError < RuntimeError
alias_method :lineno, :line_number
def initialize(message, line_number)
@line_number = line_number
super("#{message} in line #{line_number}.")
# A FieldInfo Struct contains details about a field's position in the data
# source it was read from. CSV will pass this Struct to some blocks that make
# decisions based on field structure. See CSV.convert_fields() for an
# <b><tt>index</tt></b>:: The zero-based index of the field in its row.
# <b><tt>line</tt></b>:: The line of the data source this row is from.
# <b><tt>header</tt></b>:: The header for the column, when available.
FieldInfo = Struct.new(:index, :line, :header)
# A Regexp used to find and convert some common Date formats.
DateMatcher = / \A(?: (\w+,?\s+)?\w+\s+\d{1,2},?\s+\d{2,4} |
# A Regexp used to find and convert some common DateTime formats.
/ \A(?: (\w+,?\s+)?\w+\s+\d{1,2}\s+\d{1,2}:\d{1,2}:\d{1,2},?\s+\d{2,4} |
\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2} |
(?:T\d{2}:\d{2}(?::\d{2}(?:\.\d+)?(?:[+-]\d{2}(?::\d{2})|Z)?)?)?
# The encoding used by all converters.
ConverterEncoding = Encoding.find("UTF-8")
# This Hash holds the built-in converters of CSV that can be accessed by name.
# You can select Converters with CSV.convert() or through the +options+ Hash
# <b><tt>:integer</tt></b>:: Converts any field Integer() accepts.
# <b><tt>:float</tt></b>:: Converts any field Float() accepts.
# <b><tt>:numeric</tt></b>:: A combination of <tt>:integer</tt>
# <b><tt>:date</tt></b>:: Converts any field Date::parse() accepts.
# <b><tt>:date_time</tt></b>:: Converts any field DateTime::parse() accepts.
# <b><tt>:all</tt></b>:: All built-in converters. A combination of
# <tt>:date_time</tt> and <tt>:numeric</tt>.
# All built-in converters transcode field data to UTF-8 before attempting a
# conversion. If your data cannot be transcoded to UTF-8 the conversion will
# fail and the field will remain unchanged.
# This Hash is intentionally left unfrozen and users should feel free to add
# values to it that can be accessed by all CSV objects.
# To add a combo field, the value should be an Array of names. Combo fields
# can be nested with other combo fields.
Integer(f.encode(ConverterEncoding)) rescue f
Float(f.encode(ConverterEncoding)) rescue f
numeric: [:integer, :float],
e = f.encode(ConverterEncoding)
e.match?(DateMatcher) ? Date.parse(e) : f
rescue # encoding conversion or date parse errors
e = f.encode(ConverterEncoding)
e.match?(DateTimeMatcher) ? DateTime.parse(e) : f
rescue # encoding conversion or date parse errors
all: [:date_time, :numeric],
# This Hash holds the built-in header converters of CSV that can be accessed
# by name. You can select HeaderConverters with CSV.header_convert() or
# through the +options+ Hash passed to CSV::new().
# <b><tt>:downcase</tt></b>:: Calls downcase() on the header String.
# <b><tt>:symbol</tt></b>:: Leading/trailing spaces are dropped, string is
# downcased, remaining spaces are replaced with
# underscores, non-word characters are dropped,
# and finally to_sym() is called.
# All built-in header converters transcode header data to UTF-8 before
# attempting a conversion. If your data cannot be transcoded to UTF-8 the
# conversion will fail and the header will remain unchanged.
# This Hash is intentionally left unfrozen and users should feel free to add
# values to it that can be accessed by all CSV objects.
# To add a combo field, the value should be an Array of names. Combo fields
# can be nested with other combo fields.
downcase: lambda { |h| h.encode(ConverterEncoding).downcase },
h.encode(ConverterEncoding).downcase.gsub(/[^\s\w]+/, "").strip.
# The options used when no overrides are given by calling code. They are:
# <b><tt>:col_sep</tt></b>:: <tt>","</tt>
# <b><tt>:row_sep</tt></b>:: <tt>:auto</tt>
# <b><tt>:quote_char</tt></b>:: <tt>'"'</tt>
# <b><tt>:field_size_limit</tt></b>:: +nil+
# <b><tt>:converters</tt></b>:: +nil+
# <b><tt>:unconverted_fields</tt></b>:: +nil+
# <b><tt>:headers</tt></b>:: +false+
# <b><tt>:return_headers</tt></b>:: +false+
# <b><tt>:header_converters</tt></b>:: +nil+
# <b><tt>:skip_blanks</tt></b>:: +false+
# <b><tt>:force_quotes</tt></b>:: +false+
# <b><tt>:skip_lines</tt></b>:: +nil+
# <b><tt>:liberal_parsing</tt></b>:: +false+
# <b><tt>:quote_empty</tt></b>:: +true+
# This method will return a CSV instance, just like CSV::new(), but the
# instance will be cached and returned for all future calls to this method for
# the same +data+ object (tested by Object#object_id()) with the same
# If a block is given, the instance is passed to the block and the return
# value becomes the return value of the block.
def instance(data = $stdout, **options)
# create a _signature_ for this method call, data object and options
options.values_at(*DEFAULT_OPTIONS.keys.sort_by { |sym| sym.to_s })
# fetch or create the instance for this signature
instance = (@@instances[sig] ||= new(data, **options))
yield instance # run block, if given, returning result
instance # or return the instance
# filter( **options ) { |row| ... }
# filter( input, **options ) { |row| ... }
# filter( input, output, **options ) { |row| ... }
# This method is a convenience for building Unix-like filters for CSV data.
# Each row is yielded to the provided block which can alter it as needed.
# After the block returns, the row is appended to +output+ altered or not.
# The +input+ and +output+ arguments can be anything CSV::new() accepts
# (generally String or IO objects). If not given, they default to
# <tt>ARGF</tt> and <tt>$stdout</tt>.
# The +options+ parameter is also filtered down to CSV::new() after some
# clever key parsing. Any key beginning with <tt>:in_</tt> or
# <tt>:input_</tt> will have that leading identifier stripped and will only
# be used in the +options+ Hash for the +input+ object. Keys starting with
# <tt>:out_</tt> or <tt>:output_</tt> affect only +output+. All other keys
# are assigned to both objects.
# The <tt>:output_row_sep</tt> +option+ defaults to
# <tt>$INPUT_RECORD_SEPARATOR</tt> (<tt>$/</tt>).
def filter(input=nil, output=nil, **options)
# parse options for input, output, or both
in_options, out_options = Hash.new, {row_sep: $INPUT_RECORD_SEPARATOR}
options.each do |key, value|
when /\Ain(?:put)?_(.+)\Z/
in_options[$1.to_sym] = value
when /\Aout(?:put)?_(.+)\Z/
out_options[$1.to_sym] = value
# build input and output wrappers
input = new(input || ARGF, **in_options)
output = new(output || $stdout, **out_options)
# This method is intended as the primary interface for reading CSV files. You
# pass a +path+ and any +options+ you wish to set for the read. Each row of
# file will be passed to the provided +block+ in turn.
# The +options+ parameter can be anything CSV::new() understands. This method
# also understands an additional <tt>:encoding</tt> parameter that you can use