I have downloaded a CSV file from hotmail, but it has a lot of duplicates in it. These duplicates are complete copies and I don’t know why my phone created them.
I want to get rid of the duplicates.
Write a python script to remove duplicates.
Windows XP SP 3 Python 2.7 CSV file with 400 contacts
If you are happy to use the helpful
more_itertools external library:
from more_itertools import unique_everseen with open('1.csv','r') as f, open('2.csv','w') as out_file: out_file.writelines(unique_everseen(f))
A more efficient version of @IcyFlame’s solution
with open('1.csv','r') as in_file, open('2.csv','w') as out_file: seen = set() # set for fast O(1) amortized lookup for line in in_file: if line in seen: continue # skip duplicate seen.add(line) out_file.write(line)
To edit the same file in-place you could use this
import fileinput seen = set() # set for fast O(1) amortized lookup for line in fileinput.FileInput('1.csv', inplace=1): if line in seen: continue # skip duplicate seen.add(line) print line, # standard output is now redirected to the file
you can achieve deduplicaiton efficiently using Pandas:
import pandas as pd file_name = "my_file_with_dupes.csv" file_name_output = "my_file_without_dupes.csv" df = pd.read_csv(file_name, sep="t or ,") # Notes: # - the `subset=None` means that every column is used # to determine if two rows are different; to change that specify # the columns as an array # - the `inplace=True` means that the data structure is changed and # the duplicate rows are gone df.drop_duplicates(subset=None, inplace=True) # Write the results to a different file df.to_csv(file_name_output, index=False)
You can use the following script:
1.csvis the file that consists the duplicates
2.csvis the output file that will be devoid of the duplicates once this script is executed.
inFile = open('1.csv','r') outFile = open('2.csv','w') listLines =  for line in inFile: if line in listLines: continue else: outFile.write(line) listLines.append(line) outFile.close() inFile.close()
Here, what I am doing is:
- opening a file in the read mode. This is the file that has the duplicates.
- Then in a loop that runs till the file is over, we check if the line
has already encountered.
- If it has been encountered than we don’t write it to the output file.
- If not we will write it to the output file and add it to the list of records that have been encountered already
I know this is long settled, but I have had a closely related problem whereby I was to remove duplicates based on one column. The input csv file was quite large to be opened on my pc by MS Excel/Libre Office Calc/Google Sheets; 147MB with about 2.5 million records. Since I did not want to install a whole external library for such a simple thing, I wrote the python script below to do the job in less than 5 minutes. I didn’t focus on optimization, but I believe it can be optimized to run faster and more efficient for even bigger files. The algorithm is similar to @IcyFlame above, except that I am removing duplicates based on a column (‘CCC’) instead of whole row/line.
import csv with open('results.csv', 'r') as infile, open('unique_ccc.csv', 'a') as outfile: # this list will hold unique ccc numbers, ccc_numbers =  # read input file into a dictionary, there were some null bytes in the infile results = csv.DictReader(infile) writer = csv.writer(outfile) # write column headers to output file writer.writerow( ['ID', 'CCC', 'MFLCode', 'DateCollected', 'DateTested', 'Result', 'Justification'] ) for result in results: ccc_number = result.get('CCC') # if value already exists in the list, skip writing it whole row to output file if ccc_number in ccc_numbers: continue writer.writerow([ result.get('ID'), ccc_number, result.get('MFLCode'), result.get('datecollected'), result.get('DateTested'), result.get('Result'), result.get('Justification') ]) # add the value to the list to so as to be skipped subsequently ccc_numbers.append(ccc_number)
A more efficient version of @jamylak’s solution: (with one less instruction)
with open('1.csv','r') as in_file, open('2.csv','w') as out_file: seen = set() # set for fast O(1) amortized lookup for line in in_file: if line not in seen: seen.add(line) out_file.write(line)
To edit the same file in-place you could use this
import fileinput seen = set() # set for fast O(1) amortized lookup for line in fileinput.FileInput('1.csv', inplace=1): if line not in seen: seen.add(line) print line, # standard output is now redirected to the file
You can do using pandas library in jupyter notebook or relevant IDE, I m importing pandas to jupyter notebook and reading the csv file
Then sort the values,accordingly by which parameters duplicates are present, since I have defined two attributes first it will sort by time, then by latitude
Then remove duplicates as present in time column or column relevant as per you
Then i store the duplicates removed and sorted file as gps_sorted
import pandas as pd stock=pd.read_csv("C:/Users/Donuts/GPS Trajectory/go_track_trackspoints.csv") stock2=stock.sort_values(["time","latitude"],ascending=True) stock2.drop_duplicates(subset=['time']) stock2.to_csv("C:/Users/Donuts/gps_sorted.csv",)
Hope this helps