Introduction
CSV (Comma-Separated Values) files are a widely used format for storing tabular data. They provide a simple and human-readable way to represent data, making them a popular choice for data interchange between different applications and systems.
Python, with its rich ecosystem of libraries, offers powerful tools for working with CSV files, enabling developers to easily read, write, and manipulate data. In this article, we will explore the basics to advanced of working with CSV files in Python.
Introduction to CSV Files
A CSV file consists of rows, each representing a record, and columns, which contain values for specific attributes. Columns are separated by a delimiter, usually a comma, but other delimiters like tabs or semicolons can also be used. Each row ends with a line break character.
Example CSV file (Get the file from here: ‘employee.csv‘):
Name,Age,Contact,Country
Alice,28,512-123-8182,USA
Bob,35,512-123-8183,UK
Charlie,22,512-123-8184,Canada
Reading CSV Files
Reading a CSV file using `csv.reader`
Python’s built-in `csv` module provides a straightforward way to read and write CSV files. To read a CSV file, you can use the `csv.reader` class:
import csv
with open('employee.csv', 'r') as file:
csv_reader = csv.reader(file)
for row in csv_reader:
print(row)
Output
[‘Name’, ‘Age’, ‘Contact’, ‘Country’]
[‘Alice’, ’28’, ‘512-123-8182’, ‘USA’]
[‘Bob’, ’35’, ‘512-123-8183’, ‘UK’]
[‘Charlie’, ’22’, ‘512-123-8184’, ‘Canada’]
Reading CSV into Dictionaries
While the basic `csv.reader` returns rows as lists, you can also read CSV data into dictionaries for more structured access. The `csv.DictReader` class offers reading CSV file data into a Python Dictionary.
The primary benefit of opting for the second approach is that it provides you with an organized set of data for your tasks.
import csv
with open('employee.csv', 'r') as file:
csv_dict_reader = csv.DictReader(file)
for row in csv_dict_reader:
print(row['Name'],row['Age'],row['Contact'],row['Country'])
Output
Alice 28 512-123-8182 USA
Bob 35 512-123-8183 UK
Charlie 22 512-123-8184 Canada
Writing CSV Files
Writing to a CSV file using `csv.writer`
To write data to a CSV file, you can use the `csv.writer` class. Take a quick look at the information in the ’employee.csv’ file shown above. We’re going to include an additional pair of data rows immediately adjacent to the current ones.
In the following code snippet, you’ll observe that the CSV file is opened in the append (‘a’) mode. With this approach, fresh data will be inserted alongside the current dataset.
If you intend to replace the existing data with new information, opt for the write (‘w’) mode instead of the append (‘a’) mode.
import csv
headers = ['Name','Age','Contact','Country']
rows = [['Alex','24',"512-123-8182",'Italy'],
['Peter','23',"650-231-8188",'Spain']
]
with open('employee.csv','a') as file:
csv_writer = csv.writer(file)
# csv_writer.writerow(headers)
csv_writer.writerows(rows)
Output
The current arrangement of data within the ’employee.csv’ file is now as follows:
Name,Age,Contact,Country
Alice,28,512-123-8182,USA
Bob,35,512-123-8183,UK
Charlie,22,512-123-8184,Canada
Alex,24,512-123-8182,Italy
Peter,23,650-231-8188,Spain
The `.writerow()` method is employed to insert a single data row, while `.writerows()` is utilized to add multiple rows that are organized within a Python list, dictionary, tuple, and so on.
As we’ve utilized the append mode on the non-empty ’employee.csv’ file, there’s no necessity to include the header rows more than once. This is the reason we’ve added a hashtag(‘#’) to that line (the yellow one).
If you’re contemplating generating a fresh file or overwriting the data, you should include the header rows. In such a scenario, eliminate the hashtag (‘#’) from that location (the yellow line).
Writing to a CSV file using `csv.DictWriter`
An alternative approach to writing data in a CSV file involves utilizing the `csv.DictWriter` class. This method offers better clarity and control over the order of columns and the data being written.
In this instance, we will be demonstrating how to input a data dictionary into a CSV file, complete with headers. For this scenario, we will be storing the information by employing the write (‘w’) method rather than the append (‘a’) method.
import csv
headers = ['Name','Age','Contact','Country']
rows = [{'Name':'John','Age':'26','Contact':"512-123-6841",'Country':'USA'},
{'Name':'Harry','Age':'27','Contact':"650-231-2078",'Country':'UK'}
]
with open('employee.csv','w') as file:
csv_writer = csv.DictWriter(file, headers)
csv_writer.writeheader()
csv_writer.writerows(rows)
Output
The contents of the ’employee.csv’ file now stand as follows:
Name,Age,Contact,Country
John,26,512-123-6841,USA
Harry,27,650-231-2078,UK
Customizing CSV Output
You can customize the CSV output by specifying delimiters, quote characters, and quoting styles while writing CSV data:
import csv
data = [
['Name', 'Age', 'Country'],
['Alice', 28, 'USA'],
['Bob', 35, 'Canada'],
['Charlie', 22, 'UK']
]
with open('output.csv', 'w', newline='') as file:
csv_writer = csv.writer(file, delimiter=';',
quotechar='"', quoting=csv.QUOTE_MINIMAL)
csv_writer.writerows(data)
Output
The data within the ‘output.csv’ file is currently presented in the following manner:
Name;Age;Country
Alice;28;USA
Bob;35;Canada
Charlie;22;UK
Manipulating CSV Data
Filtering and Transforming Data
Python’s list comprehensions make it easy to filter and transform CSV data. In this scenario, we will display all the rows where the age exceeds 25.
import csv
with open('employee.csv', 'r') as file:
csv_dict_reader = csv.DictReader(file)
filtered_data =
[row for row in csv_dict_reader if int(row['Age']) > 25]
for row in filtered_data:
print(row)
Output
{‘Name’: ‘John’, ‘Age’: ’26’, ‘Contact’: ‘512-123-6841’, ‘Country’: ‘USA’}
{‘Name’: ‘Harry’, ‘Age’: ’27’, ‘Contact’: ‘650-231-2078’, ‘Country’: ‘UK’}
Aggregating and Summarizing Data
You can use the `collections` module to perform aggregation and summary tasks on CSV data:
import csv
from collections import defaultdict
age_count = defaultdict(int)
with open('employee.csv', 'r') as file:
csv_dict_reader = csv.DictReader(file)
for row in csv_dict_reader:
age_count[int(row['Age'])] += 1
print(age_count)
Output
defaultdict(<class ‘int’>, {26: 1, 27: 1})
Joining and Merging CSV Files
Method 1: Merging Two CSV Files Based on a Key Column
Suppose you have two CSV files: `file1.csv` and `file2.csv`, and you want to merge them based on a common key column, such as `id`. Here’s how you can do it:
import csv
data1 = []
with open('file1.csv', 'r') as file1:
csv_reader = csv.DictReader(file1)
for row in csv_reader:
data1.append(row)
data2 = []
with open('file2.csv', 'r') as file2:
csv_reader = csv.DictReader(file2)
for row in csv_reader:
data2.append(row)
merged_data = []
for row1 in data1:
for row2 in data2:
if row1['id'] == row2['id']:
merged_row = {**row1, **row2}
merged_data.append(merged_row)
# Union of keys from both dictionaries
fieldnames = data1[0].keys() | data2[0].keys()
with open('merged.csv', 'w', newline='') as merged_file:
csv_writer = csv.DictWriter(merged_file, fieldnames=fieldnames)
csv_writer.writeheader()
csv_writer.writerows(merged_data)
Method 2: Joining Two CSV Files Based on a Key Column
Joining CSV files is a similar process to merging, but it typically involves adding selected columns from one file to another based on a common key. Here’s how you can perform a join using the `csv` module:
import csv
joined_data = []
for row1 in data1:
for row2 in data2:
if row1['id'] == row2['id']:
joined_row =
{'id': row1['id'], 'column1': row1['column1'],
'column2': row2['column2']}
joined_data.append(joined_row)
# Selected columns for the join
fieldnames = ['id', 'column1', 'column2']
with open('joined.csv', 'w', newline='') as joined_file:
csv_writer = csv.DictWriter(joined_file, fieldnames=fieldnames)
csv_writer.writeheader()
csv_writer.writerows(joined_data)
Method 3: Merging the Entire Files
It’s also possible to combine two CSV files without indicating specific key columns. In such a situation, both files need to share identical key columns.
import csv
combined_data = []
for filename in ['file1.csv', 'file2.csv']:
with open(filename, 'r') as file:
csv_dict_reader = csv.DictReader(file)
combined_data.extend(list(csv_dict_reader))
with open('combined.csv', 'w', newline='') as file:
fieldnames = combined_data[0].keys()
csv_dict_writer = csv.DictWriter(file, fieldnames=fieldnames)
csv_dict_writer.writeheader()
csv_dict_writer.writerows(combined_data)
Reading a CSV File using UTF-16 Encoding Technique
UTF-8 vs UTF-16
UTF-8 and UTF-16 are both character encodings that are used to represent and store textual data in computers. They are part of the Unicode standard, which aims to provide a consistent way of encoding characters from all the world’s writing systems. However, they differ in terms of how they encode and store characters. For Example:
- UTF-8 is a variable-width encoding, meaning that it uses 1 to 4 bytes to represent characters. ASCII characters (English letters, digits, basic symbols) are represented using a single byte (8 bits), making it storage-efficient for predominantly ASCII text. Non-ASCII characters use more bytes, depending on the character.
- UTF-16 uses 2 bytes (16 bits) for most characters, but characters outside the Basic Multilingual Plane (BMP) use 4 bytes (2 pairs of 2 bytes). This makes UTF-16 storage-efficient for a broader range of languages, as it can represent a wide variety of characters with a fixed 2-byte encoding.
UTF-16 is often used in Windows environments and for applications that require support for a wide range of characters, especially those outside the BMP.
So, If you need to support a wide range of characters or are working in a Windows environment, UTF-16 might be more suitable.
In this example, we will use a different CSV file (Get the file from here: ‘python-trends.csv‘) and use the `open()` function to open the CSV file, specifying the encoding parameter as ‘utf-16’.
import csv
# Opening the CSV file in read mode(by default)
with open('python-trends.csv', encoding='utf-16') as f:
csv_reader = csv.reader(f)
# column names
headers = next(reader)
# printing the column names
print(headers, 'n')
# printing the data from the reader object
for row in csv_reader:
print(row)
Reading a CSV File with Optional Arguments
Numerous columns or cells within a CSV file may hold data containing different special characters. To illustrate, let’s examine this with an example.
Remember that we initially interacted with a CSV file called ’employee.csv’. I introduced a small modification there to highlight some intriguing insights. Now, let’s take a look at the information stored within it.
Name Age Contact Joining_Date
“Alice” 28 512-123-8182 21-JUN, 2015
Bob 35 512-123-8183 17-SEP, 2011
Charlie’s 22 512-123-8184 12-JAN, 2020
In this file, the delimiter used is a space (‘ ‘), indicating that two values are separated by this space character.
- Orange Mark: `“Alice”` represents a name (or value) enclosed within double quotation marks.
- Green Mark: The word `Charlie’s` contains an apostrophe (‘), as shown.
- Yellow Mark: Three instances involve dates in which the year is distinctively separated by a comma from the day and month.
To get the result properly there are several optional arguments we can pass. Now we will tell the program to follow the simple steps:
- Utilize the space (‘ ‘) character as the delimiter by including the `delimiter=’ ‘` parameter. The default delimiter is a comma. If necessary, you can also indicate a semicolon (‘;’), tab (‘t’), or pipe (‘|’) as an alternative.
- Exclude the comma (‘,’) found within the date segment by using the `escapechar=’,’` parameter.
import csv
with open('employee.csv') as f:
csv_reader = csv.reader(f, delimiter=' ', escapechar=',')
header = next(csv_reader)
print(header,'n')
# printing all the rows of values
for row in csv_reader:
print(row)
Output
[‘Name’, ‘Age’, ‘Contact’, ‘Joining_Date’]
[‘Alice’, ’28’, ‘512-123-8182′, ’21-JUN 2015’]
[‘Bob’, ’35’, ‘512-123-8183′, ’17-SEP 2011′]
[“Charlie’s”, ’22’, ‘512-123-8184′, ’12-JAN 2020’]
Exercises
Here are some exercises to practice working with CSV files in Python:
1. Reading CSV Files:
- Read a CSV file named “data.csv” and print each row.
- Read a CSV file named “data.csv” with headers and print only the values in the “Name” column.
- Read a CSV file named “data.csv” and calculate the total number of rows in the file.
2. Writing CSV Files:
- Create a list of dictionaries containing information about books (title, author, year) and write it to a CSV file named “books.csv”.
- Write a program that generates a CSV file with a multiplication table for numbers 1 to 10.
- Create a CSV file named “grades.csv” and write student names and their corresponding grades to the file.
3. Manipulating CSV Data:
- Read a CSV file named “sales.csv” containing sales data (product, quantity, price) and calculate the total revenue.
- Read a CSV file named “students.csv” with student data (name, age, grade) and filter out students older than 18.
- Read a CSV file named “temperature.csv” with temperature data in Fahrenheit. Convert temperatures to Celsius and write them to a new CSV file.
4. Joining and Merging CSV Files:
- Merge two CSV files, “orders.csv” and “customers.csv,” based on the common column “customer_id,” and write the merged data to a new file.
- Join two CSV files, “products.csv” and “sales.csv,” based on the common column “product_id,” and write the joined data to a new file.
- Concatenate three CSV files, “file1.csv,” “file2.csv,” and “file3.csv,” vertically to create a single combined file.
5. Advanced Exercises:
- Explore a real-world dataset in CSV format (e.g., from Kaggle), and perform data cleaning, aggregation, and visualization using Python.
Remember to break down the exercises into smaller steps, and feel free to modify the scenarios to match your interests and learning goals.