I am trying to read a csv file present on the Google Cloud Storage bucket onto a panda dataframe.
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline from io import BytesIO from google.cloud import storage storage_client = storage.Client() bucket = storage_client.get_bucket('createbucket123') blob = bucket.blob('my.csv') path = "gs://createbucket123/my.csv" df = pd.read_csv(path)
It shows this error message:
FileNotFoundError: File b'gs://createbucket123/my.csv' does not exist
What am I doing wrong, I am not able to find any solution which does not involve google datalab?
As of version 0.24 of pandas,
read_csv supports reading directly from Google Cloud Storage. Simply provide link to the bucket like this:
df = pd.read_csv('gs://bucket/your_path.csv')
read_csv will then use
gcsfs module to read the Dataframe, which means it had to be installed (or you will get an exception pointing at missing dependency).
I leave three other options for the sake of completeness.
- Home-made code
I will cover them below.
The hard way: do-it-yourself code
I have written some convenience functions to read from Google Storage. To make it more readable I added type annotations. If you happen to be on Python 2, simply remove these and code will work all the same.
It works equally on public and private data sets, assuming you are authorised. In this approach you don’t need to download first the data to your local drive.
How to use it:
fileobj = get_byte_fileobj('my-project', 'my-bucket', 'my-path') df = pd.read_csv(fileobj)
from io import BytesIO, StringIO from google.cloud import storage from google.oauth2 import service_account def get_byte_fileobj(project: str, bucket: str, path: str, service_account_credentials_path: str = None) -> BytesIO: """ Retrieve data from a given blob on Google Storage and pass it as a file object. :param path: path within the bucket :param project: name of the project :param bucket_name: name of the bucket :param service_account_credentials_path: path to credentials. TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM') :return: file object (BytesIO) """ blob = _get_blob(bucket, path, project, service_account_credentials_path) byte_stream = BytesIO() blob.download_to_file(byte_stream) byte_stream.seek(0) return byte_stream def get_bytestring(project: str, bucket: str, path: str, service_account_credentials_path: str = None) -> bytes: """ Retrieve data from a given blob on Google Storage and pass it as a byte-string. :param path: path within the bucket :param project: name of the project :param bucket_name: name of the bucket :param service_account_credentials_path: path to credentials. TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM') :return: byte-string (needs to be decoded) """ blob = _get_blob(bucket, path, project, service_account_credentials_path) s = blob.download_as_string() return s def _get_blob(bucket_name, path, project, service_account_credentials_path): credentials = service_account.Credentials.from_service_account_file( service_account_credentials_path) if service_account_credentials_path else None storage_client = storage.Client(project=project, credentials=credentials) bucket = storage_client.get_bucket(bucket_name) blob = bucket.blob(path) return blob
gcsfs is a “Pythonic file-system for Google Cloud Storage”.
How to use it:
import pandas as pd import gcsfs fs = gcsfs.GCSFileSystem(project='my-project') with fs.open('bucket/path.csv') as f: df = pd.read_csv(f)
Dask “provides advanced parallelism for analytics, enabling performance at scale for the tools you love”. It’s great when you need to deal with large volumes of data in Python. Dask tries to mimic much of the
pandas API, making it easy to use for newcomers.
Here is the read_csv
How to use it:
import dask.dataframe as dd df = dd.read_csv('gs://bucket/data.csv') df2 = dd.read_csv('gs://bucket/path/*.csv') # nice! # df is now Dask dataframe, ready for distributed processing # If you want to have the pandas version, simply: df_pd = df.compute()
Another option is to use TensorFlow which comes with the ability to do a streaming read from Google Cloud Storage:
from tensorflow.python.lib.io import file_io with file_io.FileIO('gs://bucket/file.csv', 'r') as f: df = pd.read_csv(f)
Using tensorflow also gives you a convenient way to handle wildcards in the filename. For example:
Reading wildcard CSV into Pandas
Here is code that will read all CSVs that match a specific pattern (e.g: gs://bucket/some/dir/train-*) into a Pandas dataframe:
import tensorflow as tf from tensorflow.python.lib.io import file_io import pandas as pd def read_csv_file(filename): with file_io.FileIO(filename, 'r') as f: df = pd.read_csv(f, header=None, names=['col1', 'col2']) return df def read_csv_files(filename_pattern): filenames = tf.gfile.Glob(filename_pattern) dataframes = [read_csv_file(filename) for filename in filenames] return pd.concat(dataframes)
DATADIR='gs://my-bucket/some/dir' traindf = read_csv_files(os.path.join(DATADIR, 'train-*')) evaldf = read_csv_files(os.path.join(DATADIR, 'eval-*'))
pandas==0.24.0 this is supported natively if you have
gcsfs installed: https://github.com/pandas-dev/pandas/pull/22704.
Until the official release you can try it out with
pip install pandas==0.24.0rc1.
read_csv does not support
From the documentation:
The string could be a URL. Valid URL schemes include http, ftp, s3,
and file. For file URLs, a host is expected. For instance, a local
file could be file ://localhost/path/to/table.csv
There are three ways of accessing files in the GCS:
- Downloading the client library (this one for you)
- Using Cloud Storage Browser in the Google Cloud Platform Console
- Using gsutil, a command-line tool for working with files in Cloud Storage.
Using Step 1, setup the GSC for your work. After which you have to:
import cloudstorage as gcs from google.appengine.api import app_identity
Then you have to specify the Cloud Storage bucket name and create read/write functions for to access your bucket:
You can find the remaining read/write tutorial here:
If i understood your question correctly then maybe this link can help u get a better URL for your read_csv() function :
One will still need to use
import gcsfs if loading compressed files.
pd.read_csv('gs://your-bucket/path/data.csv.gz') in pd.version=> 0.25.3 got the following error,
/opt/conda/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds) 438 # See https://github.com/python/mypy/issues/1297 439 fp_or_buf, _, compression, should_close = get_filepath_or_buffer( --> 440 filepath_or_buffer, encoding, compression 441 ) 442 kwds["compression"] = compression /opt/conda/anaconda/lib/python3.6/site-packages/pandas/io/common.py in get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode) 211 212 if is_gcs_url(filepath_or_buffer): --> 213 from pandas.io import gcs 214 215 return gcs.get_filepath_or_buffer( /opt/conda/anaconda/lib/python3.6/site-packages/pandas/io/gcs.py in <module> 3 4 gcsfs = import_optional_dependency( ----> 5 "gcsfs", extra="The gcsfs library is required to handle GCS files" 6 ) 7 /opt/conda/anaconda/lib/python3.6/site-packages/pandas/compat/_optional.py in import_optional_dependency(name, extra, raise_on_missing, on_version) 91 except ImportError: 92 if raise_on_missing: ---> 93 raise ImportError(message.format(name=name, extra=extra)) from None 94 else: 95 return None ImportError: Missing optional dependency 'gcsfs'. The gcsfs library is required to handle GCS files Use pip or conda to install gcsfs.