How to provide a reproducible copy of your DataFrame with to_clipboard()

Posted on

Problem :

2018-09-18_reproducible_dataframe.ipynb

  • This question was previously marked as a duplicate of How to make good reproducible pandas examples.
    • Go to that question if you need to make synthetic (fake) data to share.
    • The other question and associated answers cover how to create a reproducible dataframe.
    • They do not cover how to copy an existing dataframe with .to_clipboard, while this question specifically covers .to_clipboard.

  • This may seem like an obvious question. However, many of the users asking questions about Pandas are new and inexperienced.
  • A critical component of asking a question is How to create a Minimal, Complete, and Verifiable example, which explains “what” and “why”, but not “how”.

For example, as the OP, I may have the following dataframe:

  • For this example, I’ve created synthetic data, which is an option for creating a reproducible dataset, but not within the scope of this question.
  • Think of this, as if you’ve loaded a file, and only need to share a bit of it, to reproduce the error.
import pandas as pd
import numpy as np
from datetime import datetime
from string import ascii_lowercase as al

np.random.seed(365)
rows = 15
cols = 2
data = np.random.randint(0, 10, size=(rows, cols))
index = pd.bdate_range(datetime.today(), freq='d', periods=rows)

df = pd.DataFrame(data=data, index=index, columns=list(al[:cols]))

            a  b
2020-07-30  2  4
2020-07-31  1  5
2020-08-01  2  2
2020-08-02  9  8
2020-08-03  4  0
2020-08-04  3  3
2020-08-05  7  7
2020-08-06  7  0
2020-08-07  8  4
2020-08-08  3  2
2020-08-09  6  2
2020-08-10  6  8
2020-08-11  9  6
2020-08-12  1  6
2020-08-13  5  7
  • The dataframe could be followed by some other code, that produces an error or doesn’t produce the desired outcome

Things that should be provided when asking a question on Stack Overflow.

Do not add your data as an answer to this question.

Solution :

First: Do not post images of data, text only please

Second: Do not paste data in the comments section or as an answer, edit your question instead


How to quickly provide sample data from a pandas DataFrame

  • There is more than one way to answer this question. However, this answer isn’t meant as an exhaustive solution. It provides the simplest method possible.
  • For the curious, there are other more verbose solutions provided on Stack Overflow.
  1. Provide a link to a shareable dataset (maybe on GitHub or a shared file on Google). This is particularly useful if it’s a large dataset and the objective is to optimize some method. The drawback is that the data may no longer be available in the future, which reduces the benefit of the post.
    • Data must be provided in the question, but can be accompanied by a link to a more extensive dataset.
    • Do not post only a link or an image of the data.
  2. Provide the output of df.head(10).to_clipboard(sep=',', index=True)

Code:

Provide the output of pandas.DataFrame.to_clipboard

df.head(10).to_clipboard(sep=',', index=True)
  • If you have a multi-index DataFrame add a note, telling which columns are the indices.
  • Note: when the previous line of code is executed, no output will appear.
    • The result of the code is now on the clipboard.
  • Paste the clipboard into a code block in your Stack Overflow question
,a,b
2020-07-30,2,4
2020-07-31,1,5
2020-08-01,2,2
2020-08-02,9,8
2020-08-03,4,0
2020-08-04,3,3
2020-08-05,7,7
2020-08-06,7,0
2020-08-07,8,4
2020-08-08,3,2
  • This can be copied to the clipboard by someone trying to answer your question, and followed by:
df = pd.read_clipboard(sep=',')

Locations of the dataframe other the .head(10)

  • Specify a section of the dataframe with the .iloc property
  • The following example selects rows 3 – 11 and all the columns
df.iloc[3:12, :].to_clipboard(sep=',')

Additional References for pd.read_clipboard

Google Colab Users

  • .to_clipboard() won’t work
  • Use .to_dict() to copy your dataframe
# if you have a datetime column, convert it to a str
df['date'] = df['date'].astype('str')

# if you have a datetime index, convert it to a str
df.index = df.index.astype('str')

# output to a dict
df.head(10).to_dict(orient='index')

# which will look like
{'2020-07-30': {'a': 2, 'b': 4},
 '2020-07-31': {'a': 1, 'b': 5},
 '2020-08-01': {'a': 2, 'b': 2},
 '2020-08-02': {'a': 9, 'b': 8},
 '2020-08-03': {'a': 4, 'b': 0},
 '2020-08-04': {'a': 3, 'b': 3},
 '2020-08-05': {'a': 7, 'b': 7},
 '2020-08-06': {'a': 7, 'b': 0},
 '2020-08-07': {'a': 8, 'b': 4},
 '2020-08-08': {'a': 3, 'b': 2}}

# copy the previous dict and paste into a code block on SO
# the dict can be converted to a dataframe with 
# df = pd.DataFrame.from_dict(d, orient='index')  # d is the name of the dict
# convert datatime column or index back to datetime

if you do something like print(df.head(20)) and paste the output in code format, then we can use pd.read_clipboard() to load the data into a dataframe. This approach works for the vast majority of questions posted under the pandas tag but fails miserably for questions involving multiindex

Leave a Reply

Your email address will not be published. Required fields are marked *