How to define an input for python multiprocessing function to take all the files in a directory? -


This question sounds basic because I do not know much about multiprocessing, I'm just learning.

I have a Pyro code that processes a bunch of files in a directory.

  with bridge (process = core) as PP: pp.map (function, list)  

Here's my code:

  path = '/ data / personal' print ("running with PID:% d"% getpid ()) psl = publicsuffixList () d = defaultdict (set) start = time ( ) # Files_list = glob (path) for file name in file: print (filename) for f = open (filename, 'r') n, enumerate the line (f): line = line [: - 1] ip , Reversed_domain_1 = line.split ('|') reversed_domain_2 = Reversed_domain_1.split ('.') Reversed_domain_3 = list (reverse (reversed_domain_2)) domain = ('.joined (reversed_domain_3)) domain = psl.get_public_suffix (Domain) d [IP] .add (domain) ### For domains in domain for IP domain: print (IP, domain)  

I will call it in a multiprocessing pool How can i

You can process each file in a different process like this:

< Import from IP Import from OS import from pre> , Import from Globe Import Globe with Multprossing Import Partial path = '/ data / personal' print (with "PID:% d"% getpid ()) ) Import def process_file (psl, filename): print (filename) f = open (filename, 'r') for n, line (line) = ([= 1] ip, reverse_domain_1 = line.plit ('| '') 'Reversed_domain_2 = Reversed_domain_1.split ('.') Reversed_domain_3 = list (reverse (opposite) =) domain = ('domain = psd = public_suffix (domain) return = iptables) domain = psl.get_public_suffix (domain) return ip, domain if __name__ = = "__main__": psl = publicsuffixList () d = defaultdict (set) start = time () files_list = glob (path) pp = pool (process = core) func = partial (process_file, psl) results = pp.imap_unordered (func , Files_list) for P, domain results in: D [IP] .add (domain) p.close () p.join () for IP, domain. In domain: domain for domain: print (IP, domain)

Note that defaultdict is populated in the original process, because you are actually the same < Code> defaultdict can not be shared, code> multiprocessing.Manager . If you want you can do it here, but I do not need it. Instead, as soon as the result of any child is available, we add it to the parent defaultdict instead of map instead of imap_unordered Instead of waiting for us to be ready for all of them, we will be able to get results on demand. The only other noteworthy thing is that in addition to one item from files_list in all child processes, Psl to enable list partial is used to imap_unordered .

Here is an important note: Using the multiprocessing for this kind of operation does not really improve the performance. You are reading a lot of work from your disk, which Can not be extended through many processes; Your hard drive can only work one read at a time. The reading requests for individual files from a group of processes are slowed down in real-time at a slow pace, so that they can be sequentially, because possibly reading a new line from each file For different areas of the physical disk, you have to switch continuously, now it is possible that you are working CPU-bound with each line to dominate I / O time. It is quite expensive for you, in this situation you will get momentum.


Comments

Popular posts from this blog

mysql - How to enter php data into a html multiple select box -

java - Can't add JTree to JPanel of a JInternalFrame -

c++ - Cassandra datastax cpp driver - avoiding unnecessary copies -