Monday 4 May 2015

MongoDB M101N Final Exam Question 2

Question 2:

Please use the Enron dataset you imported for the previous problem. For this question you will use the aggregation framework to figure out pairs of people that tend to communicate a lot. To do this, you will need to unwind the To list for each message.

This problem is a little tricky because a recipient may appear more than once in the To list for a message. You will need to fix that in a stage of the aggregation before doing your grouping and counting of (sender, recipient) pairs.


Which pair of people have the greatest number of messages in the dataset?


  • susan.mara@enron.com to jeff.dasovich@enron.com
  • susan.mara@enron.com to richard.shapiro@enron.com 
  • soblander@carrfut.com to soblander@carrfut.com 
  • susan.mara@enron.com to james.steffes@enron.com 
  • evelyn.metoyer@enron.com to kate.symes@enron.com 
  • susan.mara@enron.com to alan.comnes@enron.com

Solution:

The question is a bit tricky, as the single recipient may appear more than once in a "To" array and we have to consider it as one message. For that we first remove duplicate recipients from each message and recreate "To" array with unique recipients

Ans is : susan.mara@enron.com to jeff.dasovich@enron.com having maximum 750 messages

see below query:

db.messages.aggregate([
 {
   $unwind:'$headers.To'
 }, 
 {
   $group: 
      {
       _id: {'id': '$_id', 'From': '$headers.From' ,'To': '$headers.To'}, 
        To :{$addToSet : '$headers.To'}
      }
 }, 
 {
   $unwind:'$To'
 },
 {
   $group : 
      {
       _id: {'From': '$_id.From', 'To': '$_id.To'}, 
       'count': {$sum: 1}
      }
 }, 
 {
   $sort:{'count': -1}
 }, 
 {
   $limit : 5
 }], 

 {
  allowDiskUse: true 
 }).pretty()

2 comments:

  1. In your first group stage, why are you using a composite id of _id, "$headers.From" and "$headers.To". If you do that query till the first group stage, you will find that the the query is creating a new document with single element in the To array, something like this:

    Query:
    ```
    db.messages.aggregate([
    {$unwind:"$headers.To"},
    {$group:
    {_id:
    {_id:"$_id", to:"$headers.To", from:"$headers.From"},
    To:{$addToSet:"$headers.To"}
    }}
    ],
    {allowDiskUse:true});
    ```
    Response:
    ```
    {
    "_id" : {
    "_id" : ObjectId("4f16fc97d1e2d32371003f3e"),
    "to" : "bryan.hull@enron.com",
    "from" : "eric.bass@enron.com"
    },
    "To" : [
    "bryan.hull@enron.com"
    ]
    }

    /* 9 */
    {
    "_id" : {
    "_id" : ObjectId("4f16fc97d1e2d32371003f3e"),
    "to" : "david.baumbach@enron.com",
    "from" : "eric.bass@enron.com"
    },
    "To" : [
    "david.baumbach@enron.com"
    ]
    }
    ```
    as you can see the ID is the same, but we wan to again get the array To back so as to remove the duplicates. If you remove the "$headers.To" from the composite id in the group filter, you get this:

    Query:
    ```
    db.messages.aggregate([
    {$unwind:"$headers.To"},
    {$group:
    {_id:
    {_id:"$_id", from:"$headers.From"},
    To:{$addToSet:"$headers.To"}
    }}
    ],
    {allowDiskUse:true});
    ```

    Response:

    ```
    {
    "_id" : {
    "_id" : ObjectId("4f16fe37d1e2d3237107dd50"),
    "from" : "bob.shults@enron.com"
    },
    "To" : [
    "john.sherriff@enron.com",
    "louise.kitchen@enron.com",
    "kevin.mcgowan@enron.com",
    "marcello.romano@enron.com",
    "david.gallagher@enron.com",
    "kevin.presto@enron.com",
    "john.arnold@enron.com",
    "john.jeffrey@enron.com",
    "paul.racicot@enron.com",
    "thomas.martin@enron.com",
    "john.lavorato@enron.com",
    "hunter.shively@enron.com",
    "phillip.allen@enron.com",
    "jim.fallon@enron.com",
    "scott.neal@enron.com",
    "chris.mahoney@enron.com",
    "paul.mead@enron.com",
    "dana.davis@enron.com",
    "gregor.baumerich@enron.com",
    "george.mcclellan@enron.com",
    "tim.belden@enron.com",
    "rogers.herndon@enron.com",
    "jonathan.mckay@enron.com",
    "fletcher.sturm@enron.com",
    "william.white@enron.com",
    "richard.lewis@enron.com",
    "thor.lien@enron.com"
    ]
    }
    ```
    Although you again unwind $To after this, so both the result will be the same, but this is to just point out the intention behind both the queries is different.

    ReplyDelete
  2. Hello. This one works falser and without allowDiskUse : true

    db.messages.aggregate([
    { $unwind: "$headers.To" } ,
    { $group: { _id: "$_id", "mes_to": { $addToSet: "$headers.To" }, "mes_from" : { $first: "$headers.From" } }},
    { $unwind: "$mes_to" } ,
    { $group: { _id: { from: "$mes_from", to : "$mes_to" }, count: { $sum: 1 } }},
    { $sort: { "count" : -1 } },
    { $limit: 1} ])

    ReplyDelete