Given a large file that does not fit in memory (say 10GB), find the top 100000 most frequent phrases. The file has 50 phrases per line separated by a pipe (|). Assume that the phrases do not contain pipe.
Example line may look like:
Foobar Candy | Olympics 2012 | PGA | CNET | Microsoft Bing ….
The above line has 5 phrases in visible region.
Solution:
{actual method}
public List<Map.Entry<String,Integer>> findTopHundredThousandFrequentPhrases(String folderPath, String fileName) {
Map<String, Integer> occurrences = new HashMap<>();
Path path = Paths.get(folderPath, fileName);
try (BufferedReader br = Files.newBufferedReader(path, Charset.forName("UTF-8"))) {
String line;
while ((line = br.readLine()) != null) {
String[] phrases = line.split("\\|");
for (String phrase: phrases) {
if (!occurrences.containsKey(phrase)) {
occurrences.put(phrase, 1);
} else {
occurrences.put(phrase, occurrences.get(phrase) + 1);
}
}
}
} catch (IOException ioe) {
ioe.printStackTrace();
return Collections.emptyList();
}
List<Map.Entry<String, Integer>> hashMapEntries = new ArrayList<>(occurrences.entrySet());
Collections.sort(hashMapEntries, (e1, e2) -> Integer.compare(e2.getValue(), e1.getValue()));
return hashMapEntries
.stream()
.limit(100000).collect(Collectors.toList());
}
{support class}
public class ComplementaryPair {
public int addendOne;
public int addendTwo;
public ComplementaryPair(int addendOne, int addendTwo) {
this.addendOne = addendOne;
this.addendTwo = addendTwo;
}
public int sum() {
return addendOne + addendTwo;
}
@Override
public String toString() {
return addendOne + " + " + addendTwo;
}
@Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
ComplementaryPair that = (ComplementaryPair) o;
Set<Integer> numbers = new HashSet<>(
Arrays.asList(
that.addendOne, that.addendTwo, addendOne, addendTwo
));
return numbers.size() <= 2;
}
@Override
public int hashCode() {
int result = addendOne + addendTwo;
result = 31 * result;
return result;
}
}
I may be wrong but hopefully I got this right. Feel free to comment or let me know your thoughts.
Hi Paul,
ReplyDeleteBUT what about saving occurrences Map with a bulk of 10 GB of data in the Memory ?