Pandas: How to easily share a sample dataframe using df.to_dict()?

Posted on

Solving problem is about exposing yourself to as many situations as possible like Pandas: How to easily share a sample dataframe using df.to_dict()? and practice these strategies over and over. With time, it becomes second nature and a natural way you approach any problems in general. Big or small, always start with a plan, use other strategies mentioned here till you are confident and ready to code the solution.
In this post, my aim is to share an overview the topic about Pandas: How to easily share a sample dataframe using df.to_dict()?, which can be followed any time. Take easy to follow this discuss.

Pandas: How to easily share a sample dataframe using df.to_dict()?

This question was earlier marked as a duplicate of How to make good reproducible pandas examples. That contribution should undoubtedly be the go-to post for anyone seeking to make such a reproducible data sample, while this post is meant to clarify a very practical and efficient way to include a given data sample in a question using df.to_dict() in combination with df=pd.DataFrame(<dict>). This was not explicitly covered in neither the question nor the answers in How to make good reproducible pandas examples. Using df.to_dict() also works very well in tandem with df.to_clipboard(), concisely covered in the post How to provide a reproducible copy of your DataFrame with to_clipboard()


Despite the clear and concise guidance on How do I ask a good question? and How to create a Minimal, Reproducible Example, many just seem to ignore to include a reproducible data sample in their question. So what is a practical and easy way to reproduce a data sample when a simple pd.DataFrame(np.random.random(size=(5, 5))) is not enough? How can you, for example, use df.to_dict() and include the output in a question?

Asked By: vestland

||

Answer #1:

The answer:

In many situations, using an approach with df.to_dict() will do the job perfectly! Here are two cases that come to mind:

Case 1: You’ve got a dataframe built or loaded in Python from a local source

Case 2: You’ve got a table in another application (like Excel)


The details:

Case 1: You’ve got a dataframe built or loaded from a local source

Given that you’ve got a pandas dataframe named df, just

  1. run df.to_dict() in you console or editor, and
  2. copy the output that is formatted as a dictionary, and
  3. paste the content into pd.DataFrame(<output>) and include that chunk in your now reproducible code snippet.

Case 2: You’ve got a table in another application (like Excel)

Depending on the source and separator like (',', ';' '\s+') where the latter means any spaces, you can simply:

  1. Ctrl+C the contents
  2. run df=pd.read_clipboard(sep='\s+') in your console or editor, and
  3. run df.to_dict(), and
  4. include the output in df=pd.DataFrame(<output>)

In this case, the start of your question would look something like this:

import pandas as pd
df = pd.DataFrame({0: {0: 0.25474768796402636, 1: 0.5792136563952824, 2: 0.5950396800676201},
                   1: {0: 0.9071073567355232, 1: 0.1657288354283053, 2: 0.4962367707789421},
                   2: {0: 0.7440601352930207, 1: 0.7755487356392468, 2: 0.5230707257648775}})

Of course, this gets a little clumsy with larger dataframes. But very often, all anyone who seeks to answer your question need is a little sample of your real world data to take the structure of your data into consideration.

And there are two ways you can handle larger dataframes:

  1. run df.head(20).to_dict() to only include the first 20 rows, and
  2. change the format of your dict using, for example, df.to_dict('split') (there are other options besides 'split') to reshape your output to a dict that requires fewer lines.

Here’s an example using the iris dataset, among other places available from plotly express.

If you just run:

import plotly.express as px
import pandas as pd
df = px.data.iris()
df.to_dict()

This will produce an output of nearly 1000 lines, and won’t be very practical as a reproducible sample. But if you include .head(25), you’ll get:

{'sepal_length': {0: 5.1, 1: 4.9, 2: 4.7, 3: 4.6, 4: 5.0, 5: 5.4, 6: 4.6, 7: 5.0, 8: 4.4, 9: 4.9},
 'sepal_width': {0: 3.5, 1: 3.0, 2: 3.2, 3: 3.1, 4: 3.6, 5: 3.9, 6: 3.4, 7: 3.4, 8: 2.9, 9: 3.1},
 'petal_length': {0: 1.4, 1: 1.4, 2: 1.3, 3: 1.5, 4: 1.4, 5: 1.7, 6: 1.4, 7: 1.5, 8: 1.4, 9: 1.5},
 'petal_width': {0: 0.2, 1: 0.2, 2: 0.2, 3: 0.2, 4: 0.2, 5: 0.4, 6: 0.3, 7: 0.2, 8: 0.2, 9: 0.1},
 'species': {0: 'setosa', 1: 'setosa', 2: 'setosa', 3: 'setosa', 4: 'setosa', 5: 'setosa', 6: 'setosa', 7: 'setosa', 8: 'setosa', 9: 'setosa'},
 'species_id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1}}

And now we’re getting somewhere. But depending on the structure and content of the data, this may not cover the complexity of the contents in a satisfactory manner. But you can include more data on fewer lines by including to_dict('split') like this:

import plotly.express as px
df = px.data.iris().head(10)
df.to_dict('split')

Now your output will look like:

{'index': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 'columns': ['sepal_length',
  'sepal_width',
  'petal_length',
  'petal_width',
  'species',
  'species_id'],
 'data': [[5.1, 3.5, 1.4, 0.2, 'setosa', 1],
  [4.9, 3.0, 1.4, 0.2, 'setosa', 1],
  [4.7, 3.2, 1.3, 0.2, 'setosa', 1],
  [4.6, 3.1, 1.5, 0.2, 'setosa', 1],
  [5.0, 3.6, 1.4, 0.2, 'setosa', 1],
  [5.4, 3.9, 1.7, 0.4, 'setosa', 1],
  [4.6, 3.4, 1.4, 0.3, 'setosa', 1],
  [5.0, 3.4, 1.5, 0.2, 'setosa', 1],
  [4.4, 2.9, 1.4, 0.2, 'setosa', 1],
  [4.9, 3.1, 1.5, 0.1, 'setosa', 1]]}

And now you can easily increase the number in .head(10) without cluttering your question too much. But there’s one minor drawback. Now you can no longer use the input directly in pd.DataFrame. But if you include a few specifications with regards to index, column, and data you’ll be just fine. So for this particluar dataset, my preferred approach would be:

import pandas as pd
import plotly.express as px
sample = {'index': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
             'columns': ['sepal_length',
              'sepal_width',
              'petal_length',
              'petal_width',
              'species',
              'species_id'],
             'data': [[5.1, 3.5, 1.4, 0.2, 'setosa', 1],
              [4.9, 3.0, 1.4, 0.2, 'setosa', 1],
              [4.7, 3.2, 1.3, 0.2, 'setosa', 1],
              [4.6, 3.1, 1.5, 0.2, 'setosa', 1],
              [5.0, 3.6, 1.4, 0.2, 'setosa', 1],
              [5.4, 3.9, 1.7, 0.4, 'setosa', 1],
              [4.6, 3.4, 1.4, 0.3, 'setosa', 1],
              [5.0, 3.4, 1.5, 0.2, 'setosa', 1],
              [4.4, 2.9, 1.4, 0.2, 'setosa', 1],
              [4.9, 3.1, 1.5, 0.1, 'setosa', 1],
              [5.4, 3.7, 1.5, 0.2, 'setosa', 1],
              [4.8, 3.4, 1.6, 0.2, 'setosa', 1],
              [4.8, 3.0, 1.4, 0.1, 'setosa', 1],
              [4.3, 3.0, 1.1, 0.1, 'setosa', 1],
              [5.8, 4.0, 1.2, 0.2, 'setosa', 1]]}
df = pd.DataFrame(index=sample['index'], columns=sample['columns'], data=sample['data'])
df

Now you’ll have this dataframe to work with:

    sepal_length  sepal_width  petal_length  petal_width species  species_id
0            5.1          3.5           1.4          0.2  setosa           1
1            4.9          3.0           1.4          0.2  setosa           1
2            4.7          3.2           1.3          0.2  setosa           1
3            4.6          3.1           1.5          0.2  setosa           1
4            5.0          3.6           1.4          0.2  setosa           1
5            5.4          3.9           1.7          0.4  setosa           1
6            4.6          3.4           1.4          0.3  setosa           1
7            5.0          3.4           1.5          0.2  setosa           1
8            4.4          2.9           1.4          0.2  setosa           1
9            4.9          3.1           1.5          0.1  setosa           1
10           5.4          3.7           1.5          0.2  setosa           1
11           4.8          3.4           1.6          0.2  setosa           1
12           4.8          3.0           1.4          0.1  setosa           1
13           4.3          3.0           1.1          0.1  setosa           1
14           5.8          4.0           1.2          0.2  setosa           1

Which will increase your chances of receiving useful answers significantly!

Edit:

df_to_dict() will not be able to read timestamps like 1: Timestamp('2020-01-02 00:00:00') without also including from pandas import Timestamp

Answered By: vestland
The answers/resolutions are collected from stackoverflow, are licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0 .

Leave a Reply

Your email address will not be published.