如何获取书签的页码 - 编程之家 - python pdf

ChatGPT中文镜像

通过学习和理解人类的语言来进行对话, ChatGPT 真正像人类一样来聊天交流,

Chat Ai中文

[

打开

](https://www.googleadservices.com/pagead/aclk?sa=L&ai=CRoblRb9YZOvoAYaG8AKN26KABojukrtwvKHNrPYQuuDHtrkQEAEg6ofpF2CdAaABguXKuynIAQGpArRgc2cJGYE-qAMBqgSGAk_QuKRKPe9_-wrxqCPbGAooIb9E0m3VFbD7JXeBYeDchGMsch0i0oJLRBVfXQtcV6CezkRuDbCu7DT5nLMV0OEZ82THYeVUFkki147aMAQ7HBno9prJXQkhLP6bPCrT3CgtbC7cipsS39l0ELvbXR9ETR4VPy4NPOyeJ5vFUXJv-YsqdNbGadzabRqyVQn7W9wsRkaWPhsxFFpuHQy5dFXMQn5ovGdHrpww_yG3Ryt0T3XM0UI7gKhtqLVL8Ki6K-IfeJypZfVLhfsH-haeZvHeJrqbg0mSqbFQmvV-NiAVJGvHdyqLNCNd99IdGzNcoWafYj2UkWO586bVN_MXBUBVh8uKZ2DABM-d2v2kBIAHgp2bmwSoB47OG6gHk9gbqAfulrECqAf-nrECqAeko7ECqAfVyRuoB6a-G6gHmgaoB_PRG6gHltgbqAeqm7ECqAeDrbECqAf_nrECqAffn7EC2AcB0ggUCIBhEAEYHzICigI6AoBASL39wTqxCUtgHccWAAUdgAoBmAsByAsBuAwB2BMM0BUBmBYB-BYBgBcB&ae=1&num=1&cid=CAQSPABygQiDYSee05XHCRCBBvy0EeF6EhdLOv1jOBb-5OxtwhVyjECH-cSI-N9hkGsCiyjTHzJntGNYfCSUzBgB&sig=AOD64_0BkZ474BD6NNOGg0l5hrBf37rLIg&client=ca-pub-4605373693034661&rf=1&nb=8&adurl=https://chatgptmirror.com%3Fshare%3Dgooglepromotion%26gclid%3DEAIaIQobChMIq5-s5LLl_gIVBgNcCh2NrQhgEAEYASAAEgJVa_D_BwE "Chat Ai中文")

如何解决如何获取书签的页码

正如@theta指出的那样,“根据轮廓分割pdf ”具有提取页码所需的代码。如果您觉得这很复杂,我复制了一部分代码,该代码页面ID映射到页面编号并使其成为函数。这是一个打印书签o [0]的页码的工作示例:

  1. from [PyPDF2](https://www.jb51.cc/tag/PyPDF2/) import PdfFileReader
  2. def _setup_page_id_to_num(pdf, pages=None, _result=None, _num_pages=None):
  3. if _result is None:
  4. _result = {}
  5. if pages is None:
  6. _num_pages = []
  7. pages = pdf.trailer["/Root"].g[eto](https://www.jb51.cc/tag/eto/)bject()["/Pages"].g[eto](https://www.jb51.cc/tag/eto/)bject()
  8. t = pages["/Type"]
  9. if t == "/Pages":
  10. for page in pages["/Kids"]:
  11. _result[page.idnum] = len(_num_pages)
  12. _setup_page_id_to_num(pdf, page.g[eto](https://www.jb51.cc/tag/eto/)bject(), _result, _num_pages)
  13. elif t == "/Page":
  14. _num_pages.append(1)
  15. return _result
  16. # main
  17. f = open('document.pdf','rb')
  18. p = PdfFileReader(f)
  19. # map page ids to page numbers
  20. pg_id_num_map = _setup_page_id_to_num(p)
  21. o = p.g[eto](https://www.jb51.cc/tag/eto/)utli[nes](https://www.jb51.cc/tag/nes/)()
  22. pg_num = pg_id_num_map[o[0].page.idnum] + 1
  23. print(pg_num)

@theta可能为时已晚,但可能会对其他人有所帮助:) btw我关于stackoverflow的第一篇文章,所以请问如果我不遵循通常的格式

如果您希望在页面上获得书签的确切位置,这将使您的工作更加轻松:

  1. from [PyPDF2](https://www.jb51.cc/tag/PyPDF2/) import PdfFileReader
  2. import [PyPDF2](https://www.jb51.cc/tag/PyPDF2/) as pyPdf
  3. def _setup_page_id_to_num(pdf, pages=None, _result=None, _num_pages=None):
  4. if _result is None:
  5. _result = {}
  6. if pages is None:
  7. _num_pages = []
  8. pages = pdf.trailer["/Root"].g[eto](https://www.jb51.cc/tag/eto/)bject()["/Pages"].g[eto](https://www.jb51.cc/tag/eto/)bject()
  9. t = pages["/Type"]
  10. if t == "/Pages":
  11. for page in pages["/Kids"]:
  12. _result[page.idnum] = len(_num_pages)
  13. _setup_page_id_to_num(pdf, page.g[eto](https://www.jb51.cc/tag/eto/)bject(), _result, _num_pages)
  14. elif t == "/Page":
  15. _num_pages.append(1)
  16. return _result
  17. def outli[nes](https://www.jb51.cc/tag/nes/)_pg_zoom_info(outli[nes](https://www.jb51.cc/tag/nes/), pg_id_num_map, result=None):
  18. if result is None:
  19. result = dict()
  20. if type(outli[nes](https://www.jb51.cc/tag/nes/)) == list:
  21. for outline in outli[nes](https://www.jb51.cc/tag/nes/):
  22. result = outli[nes](https://www.jb51.cc/tag/nes/)_pg_zoom_info(outline, pg_id_num_map, result)
  23. elif type(outli[nes](https://www.jb51.cc/tag/nes/)) == pyPdf.pdf.Destination:
  24. title = outli[nes](https://www.jb51.cc/tag/nes/)['/Title']
  25. result[title.split()[0]] = dict(title=outli[nes](https://www.jb51.cc/tag/nes/)['/Title'], top=outli[nes](https://www.jb51.cc/tag/nes/)['/Top'], \
  26. left=outli[nes](https://www.jb51.cc/tag/nes/)['/Left'], page=(pg_id_num_map[outli[nes](https://www.jb51.cc/tag/nes/).page.idnum]+1))
  27. return result
  28. # main
  29. pdf_name = 'document.pdf'
  30. f = open(pdf_name,'rb')
  31. pdf = PdfFileReader(f)
  32. # map page ids to page numbers
  33. pg_id_num_map = _setup_page_id_to_num(pdf)
  34. outli[nes](https://www.jb51.cc/tag/nes/) = pdf.g[eto](https://www.jb51.cc/tag/eto/)utli[nes](https://www.jb51.cc/tag/nes/)()
  35. bookmarks_info = outli[nes](https://www.jb51.cc/tag/nes/)_pg_zoom_info(outli[nes](https://www.jb51.cc/tag/nes/), pg_id_num_map)
  36. print(bookmarks_info)
  37. elif type(outli[nes](https://www.jb51.cc/tag/nes/)) == pyPdf.pdf.Destination:
  38. title = outli[nes](https://www.jb51.cc/tag/nes/)['/Title']
  39. result[title.split()[0]] = dict(title=outli[nes](https://www.jb51.cc/tag/nes/)['/Title'], top=outli[nes](https://www.jb51.cc/tag/nes/)['/Top'], \
  40. left=outli[nes](https://www.jb51.cc/tag/nes/)['/Left'], page=(pg_id_num_map[outli[nes](https://www.jb51.cc/tag/nes/).page.idnum]+1))

解决方法

from typing import List
from PyPDF2 import PdfFileReader
from PyPDF2.generic import Destination

  1. def get_outlines(pdf_filepath: str) -> List[Destination]:
  2. """Get the bookmarks of a PDF file."""
  3. with open(pdf_filepath,"rb") as fp:
  4. pdf_file_reader = PdfFileReader(fp)
  5. outlines = pdf_file_reader.getOutlines()
  6. return outlines
  7. print(get_outlines("PDF-export-example.pdf"))

pyPdf.pdf.Destination具有许多属性,但是找不到该书签的任何引用页码。如何获得书签的页码?

    • *

例如,outlines[1].page.idnum返回一个数字,该数字大约是PDF文档中引用的页码的3倍,我认为引用的对象比页面小,因为.page.idnum在整个PDF文档轮廓上运行时返回的数字数组甚至与“实数”都不线性相关PDF文档中的页码目标,大约是3的倍数


原网址: 访问
创建于: 2023-05-08 17:22:43
目录: default
标签: 无

请先后发表评论
  • 最新评论
  • 总共0条评论