Documents
Methods for drawing down, editing and uploading data about documents.
DocumentClient
- class documentcloud.documents.DocumentClient
The document client gives access to retrieval and uploading of documents. It is generally accessed as
client.documents
.- get(id_, expand=None)
Return the document with the provided DocumentCloud identifer.
>>> from documentcloud import DocumentCloud >>> client = DocumentCloud(USERNAME, PASSWORD) >>> client.documents.get(71072) <Document: Final OIR Report>
The identifier may be just the numeric ID (71072, preferred), the old style ID-slug (71072-oir-final-report), or the new style slug-id (oir-final-report-71072).
Setting expand allows for the user or organization details to be fetched with the document, instead of requiring a separate API request to fetch them. Set expand to a list of the values you would like expanded.
>>> client.documents.get(71072, expand=["user"]) >>> client.documents.get(71072, expand=["user", "organization"])
- list(self, **params)
Return a list of all documents, possibly filtered by the given parameters. Please see the full API documentation for available parameters.
- search(query, **params)
Return a list of documents that match the provided query.
>>> from documentcloud import DocumentCloud >>> client = DocumentCloud() >>> obj_list = client.documents.search('Ruben Salazar') >>> obj_list[0]
<Document: Final OIR Report>
The params may be set to any parameters that the search end point takes. Please see the full search documentation for query syntax and available parameters.
- upload(pdf, **kwargs)
Upload a PDF to DocumentCloud. You must be authorized to do this. Returns the object representing the new record you’ve created. You can submit either a file path or a file object.
>>> from documentcloud import DocumentCloud >>> client = DocumentCloud(USERNAME, PASSWORD) >>> new_id = client.documents.upload("/home/ben/test.pdf", title="Test PDF") >>> # Now fetch it >>> client.documents.get(new_id) <Document: Test PDF>
You can also use URLs which link to PDFs, if that’s the kind of thing you want to do.
>>> upload("http://ord.legistar.com/Chicago/attachments/e3a0cbcb-044d-4ec3-9848-23c5692b1943.pdf")
You may set the
kwargs
to any of the writable attributes as described indocumentcloud.documents.Document.put()
. Additionally, you may setforce_ocr
in order to force OCR to take place even if the document has embedded text. You may specify which OCR engine to use for OCR by settingocr_engine
to eithertess4
for tesseract ortextract
for Amazon Textract. Note that Amazon Textract uses AI Credits and requires a DocumentCloud Premium account. You may setproject
to the ID of a project to upload the document into, orprojects
, a list of project IDs to upload the document into. If you are uploading a non-PDF document type, you must setoriginal_extension
to the extension of the file type, such asdocx
orjpg
.
- upload_directory(path, handle_errors=False, extensions='.pdf'**kwargs)
Searches through the provided path and attempts to upload all the PDFs it can find. Metadata, which accepts the same keywords as
upload()
, provided to the other keyword arguments will be recorded for all uploads (except for title which will be set based on the filename). Returns a list of document objects that are created. Be warned, this will upload any documents in directories inside the path you specify. The handle_errors parameter will catch and print errors generated by the network request or the DocumentCloud API, log them, and try to continue processing. This might be useful if you are uploading a very large directory and do not want temporary network problems to stop the entire upload. By default, extensions is set to “.pdf”, so it will only upload PDFs in the specified directory. You can specify a different extension, a list of extensions, or None. If None is explicitly specified, it will upload any documents that are supported by DocumentCloud in the present directory. If you pass a file extension type that is not supported by DocumentCloud, ValueError will be raised telling you which extension is not supported.- The following will upload all PDFs in the groucho_marx directory:
>>> from documentcloud import DocumentCloud >>> client = DocumentCloud(DOCUMENTCLOUD_USERNAME, DOCUMENTCLOUD_PASSWORD) >>> obj_list = client.documents.upload_directory('/home/ben/pdfs/groucho_marx/')
- The following will upload all .txt and .jpg files in the groucho_marx directory:
>>> obj_list = client.documents.upload_directory('/home/ben/pdfs/groucho_marx/', extensions = ['.txt', '.jpg'])
- The following will upload all files that are supported by DocumentCloud in the groucho_marx directory:
>>> obj_list = client.documents.upload_directory('/home/ben/pdfs/groucho_marx/', extensions=None)
- upload_urls(self, url_list, handle_errors=False, **kwargs)
Given a list of urls, it will attempt to upload the URLs in batches of 25 at a time.
>>> urls = ["https://www.chicago.gov/content/dam/city/depts/dcd/tif/22reports/T_072_24thMichiganAR22.pdf", "https://www.chicago.gov/content/dam/city/depts/dcd/tif/22reports/T_063_CanalCongressAR22.pdf"] >>> new = client.documents.upload_urls(urls) >>> new [<Document: 23932356 - T_072_24thMichiganAR22>, <Document: 23932357 - T_063_CanalCongressAR22>]
Document
- class documentcloud.documents.Document
An individual document, as obtained by the
documentcloud.documents.DocumentClient
.- put()
Save changes to a document back to DocumentCloud. You must be authorized to make these changes. Only the
access
,data
,description
,language
,related_article
,published_url
,source
, andtitle
, attributes may be edited.>>> # Grab a document >>> obj = client.documents.get('71072') >>> print(obj.title) Draft OIR Report >>> # Change its title >>> obj.title = "Brand new title" >>> print(obj.title) Brand New Title >>> # Save those changes >>> obj.put()
- delete()
Delete a document from DocumentCloud. You must be authorized to make these changes.
>>> obj = client.documents.get('71072-oir-final-report') >>> obj.delete()
- process()
This will re-process the document. Useful if there was an intermittent error.
- access
The privacy level of the resource within the DocumentCloud system. It will be either
public
,private
ororganization
, the last of which means the is only visible to members of the contributors organization. Can be edited and saved with a put command.
- annotations
A client to access and update the annotations on the document. See
Annotation
for more information.
- asset_url
The base URL to obtain the static assets for this document. See the API documentation for more details.
- canonical_url
The URL where the document is hosted at documentcloud.org.
- contributor
The user who originally uploaded the document.
- contributor_organization
The organizational affiliation of the user who originally uploaded the document.
- created_at
The date and time that the document was created, in Python’s datetime format.
- data
A dictionary containing supplementary data linked to the document. This can be any old thing. It’s useful if you’d like to store additional metadata. Can be edited and saved with a put command.
>>> obj = client.documents.get('83251-fbi-file-on-christopher-biggie-smalls-wallace') >>> obj.data {'category': 'hip-hop', 'byline': 'Ben Welsh', 'pub_date': datetime.date(2011, 3, 1)}
Keys must be strings and only contain alphanumeric characters.
- description
A summary of the document. Can be edited and saved with a put command.
- edit_access
A boolean indicating whether or not you have the ability to save this document.
- file_hash
A hash representation of the raw PDF data as a hexadecimal string.
>>> obj = client.documents.get('1021571-lafd-2013-hiring-statistics') >>> obj.file_hash '872b9b858f5f3e6bb6086fec7f05dd464b60eb26'
You could recreate this hexadecimal hash yourself using the SHA-1 algorithm.
>>> import hashlib >>> hashlib.sha1(obj.pdf).hexdigest() '872b9b858f5f3e6bb6086fec7f05dd464b60eb26'
- full_text
Returns the full text of the document, as extracted from the original PDF by DocumentCloud. Results may vary, but this will give you what they got.
>>> obj = client.documents.get('71072-oir-final-report') >>> obj.full_text "Review of the Los Angeles County Sheriff's\nDepartment's Investigation into the\nHomicide of Ruben Salazar\nA Special Report by the\nLos Angeles County Office of Independent Review\n ...
- full_text_url
Returns the URL that contains the full text of the document, as extracted from the original PDF by DocumentCloud.
- get_errors()
Returns a list containing entries for each error on the document.
>>> new = client.documents.upload("https://www.launchcamden.com/wp-content/uploads/2023/08/7.13.23_01002.pdf") >>> client.documents.get(new.id).get_errors() [{'id': 96136, 'created_at': datetime.datetime(2023, 8, 30, 16, 28, 8, 594859), 'message': '404 Client Error: Not Found for url: https://www.launchcamden.com/wp-content/uploads/2023/08/7.13.23_01002.pdf'}]
- get_page_text(page)
Submit a page number and receive the raw text extracted from it by DocumentCloud.
>>> obj = client.documents.get('1088501-adventuretime-alta') >>> txt = obj.get_page_text(1) # Let's print just the first line >>> print(txt.split("\n")[0]) STATE OF CALIFORNIA- HEALTH AND HUMAN SERVICES AGENCY
- get_page_position_json(page)
Submit a page number and receive the page text position information in JSON format
>>> obj = client.documents.get('1088501-adventuretime-alta') >>> json = obj.get_page_position_json(1)
- id
The unique identifer of the document in DocumentCloud’s system. This is a number.
83251
- language
The three character code for the language this document is in.
- large_image
Returns the binary data for the “large” sized image of the document’s first page. If you would like the data for some other page, pass the page number into
get_large_image(page)
.
- large_image_url
Returns a URL containing the “large” sized image of the document’s first page. If you would like the URL for some other page, pass the page number into
get_large_image_url(page)
.
- large_image_url_list
Returns a list of URLs for the “large” sized image of every page in the document.
- mentions
When the document has been retrieved via a search, this returns a list of places the search keywords appear in the text. You must pass mentions = True into the search. The data is modeled by its own Python class,
documentcloud.documents.Mention
.>>> obj_list = client.documents.search('Christopher Wallace', mentions=True) >>> obj = obj_list[0] >>> obj.mentions [<Mention: Page 2>, <Mention: Page 3> ....
- normal_image
Returns the binary data for the “normal” sized image of the document’s first page. If you would like the data for some other page, pass the page number into
get_normal_image(page)
.
- normal_image_url
Returns a URL containing the “normal” sized image of the document’s first page. If you would like the URL for some other page, pass the page number into
get_normal_image_url(page)
.
- normal_image_url_list
Returns a list of URLs for the “normal” sized image of every page in the document.
- organization
The
documentcloud.organizations.Organization
which owns this document. This will require an additional API call unless you specify “organization” in the expand parameter when fetching this document.
- organization_id
The ID for the organization which owns this document
- page_spec
The page spec is a compressed string that lists dimensions in pixels for every page in a document. Refer to ListCrunch for the compression format. For example, 612.0x792.0:0-447
- pages
The number of pages in the document.
- pdf
Returns the binary data for document’s original PDF file.
- pdf_url
Returns a URL containing the binary data for document’s original PDF file.
- projects
Returns a list of IDs for the projects this document is in.
- published_url
Returns an URL outside of documentcloud.org where this document has been published.
Returns an URL for a news story related to this document.
- sections
A client to access and update the sections on the document. See
documentcloud.sections.Section
for more information.
- slug
Returns the document’s slug. A slug is a URL friendly version of the title.
- small_image
Returns the binary data for the “small” sized image of the document’s first page. If you would like the data for some other page, pass the page number into
get_small_image(page)
.
- small_image_url
Returns a URL containing the “small” sized image of the document’s first page. If you would like the URL for some other page, pass the page number into
get_small_image_url(page)
.
- small_image_url_list
Returns a list of URLs for the “small” sized image of every page in the document.
- source
The original source of the document. Can be edited and saved with a put command.
- status
This is the status of the document. Possible statuses include:
success: The document has been succesfully processed
readable: The document is currently processing, but is readable during the operation
pending: The document is processing and not currently readable
error: There was an [error](#errors) during processing
nofile: The document was created, but no file was uploaded yet
- thumbnail_image
Returns the binary data for the “thumbnail” sized image of the document’s first page. If you would like the data for some other page, pass the page number into
get_thumbnail_image(page)
.
- thumbnail_image_url
Returns a URL containing the “thumbnail” sized image of the document’s first page. If you would like the URL for some other page, pass the page number into
get_small_thumbnail_url(page)
.
- thumbnail_image_url_list
Returns a list of URLs for the “small” sized image of every page in the document.
- title
The name of the document. Can be edited and saved with a put command.
- updated_at
The date and time that the document was last updated, in Python’s datetime format.
- user
The
documentcloud.users.User
which owns this document. This will require an additional API call unless you specify “user” in the expand parameter when fetching this document.
- user_id
The ID for the user which owns this document