from pymongoarrow.monkey import patch_all This function extends PyMongo with the PyMongoArrow methods. Then you’ll need to import the package and call the patch_all() function. Let’s see how it works!įirst, you’ll need to install the library with: python -m pip install pymongoarrow If you see MongoDB official documentation, they suggest to use their new support library to get data from MongoDB to Pandas, Numpy and Apache Arrow. This is a balance between memory and efficiencyĪnd then running: data_df = iterator2dataframes(result, 20000) The PyMongoArrow way """Turn an iterator into multiple small pandas.DataFrame A slightly different solution can be implemented with a helper function: def iterator2dataframes(iterator, chunk_size: int): The second solution above still makes you load all data in memory in order to create your DataFrame. You can be a little bit more parsimonious using an iterator rather than a list: data_df = pd.DataFrame(iter(result)) Getting some balance However, this creates an in-memory object with a size directly proportional to the amount of data your retrieved. One very simple way to do it is to cast the cursor resulting from your aggregation query to a list, and feed it to a Pandas DataFrame constructor: data_df = pd.DataFrame(list(result)) Where mongo_connection_string is your URI for your MongoDB instance, db is the name of your database, col is the name of your collection, and YOUR_AGG_PIPELINE is your aggregation pipeline. So, how do you get the data you queried into a Pandas DataFrame?įor all subsequent examples I will assume you already initialized your environment this way: from pymongo import MongoClientĬlient = MongoClient("mongo_connection_string") And nothing is better than Pandas to perform further transformation or analysis. Now, the main goal to get data from MongoDB is to perform some task. So you should know that Compass is the way to go for creating Aggregation pipelines. I already provided an introduction to MongoDB and Compass in a previous post for my MongoDB series. We will see how to save results from aggregation pipelines into a Pandas DataFrame. MongoDB is one of the leading NoSQL databases, and its aggregation framework enables powerful queries, as well as data operations.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |