How to define an input for python multiprocessing function to take all the files in a directory? -
This question sounds basic because I do not know much about multiprocessing, I'm just learning.
I have a Pyro code that processes a bunch of files in a directory.
with bridge (process = core) as PP: pp.map (function, list)
Here's my code:
path = '/ data / personal' print ("running with PID:% d"% getpid ()) psl = publicsuffixList () d = defaultdict (set) start = time ( ) # Files_list = glob (path) for file name in file: print (filename) for f = open (filename, 'r') n, enumerate the line (f): line = line [: - 1] ip , Reversed_domain_1 = line.split ('|') reversed_domain_2 = Reversed_domain_1.split ('.') Reversed_domain_3 = list (reverse (reversed_domain_2)) domain = ('.joined (reversed_domain_3)) domain = psl.get_public_suffix (Domain) d [IP] .add (domain) ### For domains in domain for IP domain: print (IP, domain)
I will call it in a multiprocessing pool How can i
You can process each file in a different process like this:
< Import from IP Import from OS import from pre>, Import from Globe Import Globe with Multprossing Import Partial path = '/ data / personal' print (with "PID:% d"% getpid ()) ) Import def process_file (psl, filename): print (filename) f = open (filename, 'r') for n, line (line) = ([= 1] ip, reverse_domain_1 = line.plit ('| '') 'Reversed_domain_2 = Reversed_domain_1.split ('.') Reversed_domain_3 = list (reverse (opposite) =) domain = ('domain = psd = public_suffix (domain) return = iptables) domain = psl.get_public_suffix (domain) return ip, domain if __name__ = = "__main__": psl = publicsuffixList () d = defaultdict (set) start = time () files_list = glob (path) pp = pool (process = core) func = partial (process_file, psl) results = pp.imap_unordered (func , Files_list) for P, domain results in: D [IP] .add (domain) p.close () p.join () for IP, domain. In domain: domain for domain: print (IP, domain)
Note that defaultdict
is populated in the original process, because you are actually the same < Code> defaultdict can not be shared, code> multiprocessing.Manager . If you want you can do it here, but I do not need it. Instead, as soon as the result of any child is available, we add it to the parent defaultdict
instead of map
instead of imap_unordered
Instead of waiting for us to be ready for all of them, we will be able to get results on demand. The only other noteworthy thing is that in addition to one item from files_list
in all child processes, Psl
to enable list partial is used to
imap_unordered
.
Here is an important note: Using the multiprocessing
for this kind of operation does not really improve the performance. You are reading a lot of work from your disk, which Can not be extended through many processes; Your hard drive can only work one read at a time. The reading requests for individual files from a group of processes are slowed down in real-time at a slow pace, so that they can be sequentially, because possibly reading a new line from each file For different areas of the physical disk, you have to switch continuously, now it is possible that you are working CPU-bound with each line to dominate I / O time. It is quite expensive for you, in this situation you will get momentum.
Comments
Post a Comment