Mayur Patel: MongoDB M101N Final Exam Question 2

Monday, 4 May 2015

MongoDB M101N Final Exam Question 2

Question 2:

Please use the Enron dataset you imported for the previous problem. For this question you will use the aggregation framework to figure out pairs of people that tend to communicate a lot. To do this, you will need to unwind the To list for each message.

This problem is a little tricky because a recipient may appear more than once in the To list for a message. You will need to fix that in a stage of the aggregation before doing your grouping and counting of (sender, recipient) pairs.

Which pair of people have the greatest number of messages in the dataset?

susan.mara@enron.com to jeff.dasovich@enron.com
susan.mara@enron.com to richard.shapiro@enron.com
soblander@carrfut.com to soblander@carrfut.com
susan.mara@enron.com to james.steffes@enron.com
evelyn.metoyer@enron.com to kate.symes@enron.com
susan.mara@enron.com to alan.comnes@enron.com

Solution:

The question is a bit tricky, as the single recipient may appear more than once in a "To" array and we have to consider it as one message. For that we first remove duplicate recipients from each message and recreate "To" array with unique recipients

Ans is : susan.mara@enron.com to jeff.dasovich@enron.com having maximum 750 messages

see below query:

db.messages.aggregate([
{
   $unwind:'$headers.To'
},
{
   $group:
      {
       _id: {'id': '$_id', 'From': '$headers.From' ,'To': '$headers.To'},
        To :{$addToSet : '$headers.To'}
      }
},
{
   $unwind:'$To'
},
{
   $group :
      {
       _id: {'From': '$_id.From', 'To': '$_id.To'},
       'count': {$sum: 1}
      }
},
{
   $sort:{'count': -1}
},
{
   $limit : 5
}],

{
allowDiskUse: true
}).pretty()

2 comments:

Unknown18 September 2016 at 10:42
In your first group stage, why are you using a composite id of _id, "$headers.From" and "$headers.To". If you do that query till the first group stage, you will find that the the query is creating a new document with single element in the To array, something like this:

Query:
```
db.messages.aggregate([
{$unwind:"$headers.To"},
{$group:
{_id:
{_id:"$_id", to:"$headers.To", from:"$headers.From"},
To:{$addToSet:"$headers.To"}
}}
],
{allowDiskUse:true});
```
Response:
```
{
"_id" : {
"_id" : ObjectId("4f16fc97d1e2d32371003f3e"),
"to" : "bryan.hull@enron.com",
"from" : "eric.bass@enron.com"
},
"To" : [
"bryan.hull@enron.com"
]
}

/* 9 */
{
"_id" : {
"_id" : ObjectId("4f16fc97d1e2d32371003f3e"),
"to" : "david.baumbach@enron.com",
"from" : "eric.bass@enron.com"
},
"To" : [
"david.baumbach@enron.com"
]
}
```
as you can see the ID is the same, but we wan to again get the array To back so as to remove the duplicates. If you remove the "$headers.To" from the composite id in the group filter, you get this:

Query:
```
db.messages.aggregate([
{$unwind:"$headers.To"},
{$group:
{_id:
{_id:"$_id", from:"$headers.From"},
To:{$addToSet:"$headers.To"}
}}
],
{allowDiskUse:true});
```

Response:

```
{
"_id" : {
"_id" : ObjectId("4f16fe37d1e2d3237107dd50"),
"from" : "bob.shults@enron.com"
},
"To" : [
"john.sherriff@enron.com",
"louise.kitchen@enron.com",
"kevin.mcgowan@enron.com",
"marcello.romano@enron.com",
"david.gallagher@enron.com",
"kevin.presto@enron.com",
"john.arnold@enron.com",
"john.jeffrey@enron.com",
"paul.racicot@enron.com",
"thomas.martin@enron.com",
"john.lavorato@enron.com",
"hunter.shively@enron.com",
"phillip.allen@enron.com",
"jim.fallon@enron.com",
"scott.neal@enron.com",
"chris.mahoney@enron.com",
"paul.mead@enron.com",
"dana.davis@enron.com",
"gregor.baumerich@enron.com",
"george.mcclellan@enron.com",
"tim.belden@enron.com",
"rogers.herndon@enron.com",
"jonathan.mckay@enron.com",
"fletcher.sturm@enron.com",
"william.white@enron.com",
"richard.lewis@enron.com",
"thor.lien@enron.com"
]
}
```
Although you again unwind $To after this, so both the result will be the same, but this is to just point out the intention behind both the queries is different.
ReplyDelete
Replies
Unknown10 December 2017 at 10:12
Hello. This one works falser and without allowDiskUse : true

db.messages.aggregate([
{ $unwind: "$headers.To" } ,
{ $group: { _id: "$_id", "mes_to": { $addToSet: "$headers.To" }, "mes_from" : { $first: "$headers.From" } }},
{ $unwind: "$mes_to" } ,
{ $group: { _id: { from: "$mes_from", to : "$mes_to" }, count: { $sum: 1 } }},
{ $sort: { "count" : -1 } },
{ $limit: 1} ])
ReplyDelete
Replies

Mayur Patel

Pages

Monday, 4 May 2015

MongoDB M101N Final Exam Question 2

2 comments:

Followers

Total Pageviews

Labels

Blog Archive