arxiv_on_deck_2 package#
Submodules#
arxiv_on_deck_2.arxiv module#
- class ArXivPaper(identifier: str = '', highlight_authors: Sequence[str] | None = None, appearedon: str | None = None)[source]#
Bases:
object
Class that handles the interface to Arxiv website paper abstract
Represents a paper from ArXiv.
The main URLs are:
source = “https://arxiv.org/e-print/{identifier}”
abstract = “https://arxiv.org/abs/{identifier}”
- Parameters:
identifier – the identifier of the paper
highlight_authors – a list of authors to highlight
appearedon – the date the paper appeared on
- property authors: str#
Return the authors of the paper
- Returns:
the authors of the paper
- classmethod from_identifier(identifier: str)[source]#
Instanciate from paper identifier
- Parameters:
identifier (
str
) – the identifier of the paper
- property short_authors: str#
Return a short list of authors (e.g., <first author, et al.>)
- class ArxivListHTMLParser(*args, **kwargs)[source]#
Bases:
HTMLParser
generates a list of Paper items by parsing the Arxiv new page
- filter_papers(papers: Sequence[ArXivPaper], fname_list: Sequence[str]) Sequence[ArXivPaper] [source]#
Extract papers when an author match is found
- Parameters:
papers – paper list
fname_list – authors to search
- Returns:
papers with matching author
- get_new_papers(skip_replacements: bool = True) Sequence[ArXivPaper] [source]#
retrieve the new list from the website.
- Parameters:
skip_replacements (bool, optional) – set to skip parsing the replacements, defaults to True
- Returns:
list of ArXivPaper objects
- Return type:
Sequence[ArXivPaper]
arxiv_on_deck_2.arxiv2 module#
How to deal with ArXiv and getting papers’ information and sources
- class ArxivPaper(**paper_data)[source]#
Bases:
dict
Paper representation using ArXiv information
A dictionary like structure that contains:
identifier: the arxiv identification number
title: the title of the paper
authors: the authors of the paper
comments: the comments of the paper
abstract: the abstract of the paper
- classmethod from_bs4_tags(dt: Tag, dd: Tag)[source]#
extract paper information from its pair of tags
- Parameters:
dt (
Tag
) – the tag of the titledd (
Tag
) – the tag of the description
- Returns:
the paper object
- generate_markdown_text() str [source]#
Generate the markdown text of this paper
- Returns:
the markdown text summary of the paper
- property short_authors: Sequence[str]#
Make a short author list if there are more authors than nmin This means <first author> et al., – incl <list of hihglighted authors>
- Parameters:
authors – the list of authors
nmin – the minimum number of authors to switch to short representation
- Returns:
the list of authors
- get_markdown_badge(identifier: str) str [source]#
Generate the markdown badge for a paper
- Return type:
str
- Parameters:
identifier (
str
) – arxiv identifier of the paper- Returns:
markdown badge
- get_new_papers() Sequence[ArxivPaper] [source]#
retrieve the new list from the website.
- Returns:
list of ArXivPaper objects
- get_paper_from_identifier(paper_identifier: str) ArxivPaper [source]#
Retrieve a paper from Arxiv using its identifier
- Return type:
- Parameters:
paper_identifier (
str
) – arxiv identifier of the paper- Returns:
Paper object
- retrieve_document_source(identifier: str, directory: str) str [source]#
Retrieve document source tarball and extract it.
- Return type:
str
- Parameters:
identifier (
str
) – Paper identification number from Arxivdirectory (
str
) – where to store the extracted files
- Returns:
directory in which the data was extracted
arxiv_on_deck_2.arxiv_vanity module#
Interface to Arxiv Vanity
access a paper through the following URL: https://www.arxiv-vanity.com/papers/{arxiv_id}
If the paper is not already stored, it will be processed.
- author_match(author: str, hl_list: Sequence[str], verbose=False) Sequence[str] [source]#
Matching author names with a family name reference list
- Parameters:
author (
str
) – the author string to checkhl_list – the list of reference authors to match
verbose (
bool
) – prints matching results if set
- Returns:
the matching sequences or empty sequence if None
- collect_summary_information(identifiers: Sequence[str], content_requirements: callable | None = None, wait: int = 10) Sequence[dict] [source]#
Extract necessary information from the vanity webpage
- Parameters:
identifers – list of arxiv identifiers to attempt to retrieve
content_requirement – filter function that returns False if the paper does not meet the requirements
wait (
int
) – how many seconds to wait between retries.
- Returns:
a dictionary with the following keys: (title, authors, abstract, paper_id, url, figures, and the soup object)
- generate_markdown_text(content: dict) str [source]#
Generate the summary markdown content
- Return type:
str
- Parameters:
content (
dict
) – the content of the paper- Returns:
the markdown representation of the paper
- get_arxiv_vanity_badge(identifier: str) str [source]#
Generate the markdown badge for a paper
- Return type:
str
- Parameters:
identifier (
str
) – arxiv identifier of the paper- Returns:
markdown badge
- highlight_author(author_list: Sequence[str], author: str) Sequence[str] [source]#
Highlight a particular author in the list of authors
- Parameters:
author_list – the list of authors
author (
str
) – the author to highlight
- Returns:
the list of authors with the highlighted author
- highlight_authors_in_list(author_list: Sequence[str], hl_list: Sequence[str], verbose: bool = False) Sequence[str] [source]#
highlight all authors of the paper that match lst entries
- Parameters:
author_list – the list of authors
hl_list – the list of authors to highlight
verbose (
bool
) – prints matching results if set
- Returns:
the list of authors with the highlighted authors
- make_short_author_list(authors: Sequence[str], nmin: int = 4) Sequence[str] [source]#
Make a short author list if there are more authors than nmin This means <first author> et al., – incl <list of hihglighted authors>
- Parameters:
authors – the list of authors
nmin (
int
) – the minimum number of authors to switch to short representation
- Returns:
the list of authors
arxiv_on_deck_2.latex module#
- class LatexDocument(folder: str, validation: callable | None = None, debug: bool = False)[source]#
Bases:
object
Handles the latex document interface.
Allows to extract title, authors, figures, abstract
- Parameters:
folder – folder containing the document
main_file – name of the main document
content – the document content from TexSoup
title – the title of the paper
authors – the authors of the paper
comments – the comments of the paper
abstract – the abstract of the paper
- property abstract: str#
All figures from the paper
- property authors: str#
All figures from the paper
- property figures: Sequence[LatexFigure]#
All figures from the paper
- generate_markdown_text(with_figures: bool = True) str [source]#
Generate the markdown summary
- Return type:
str
- Parameters:
with_figures (
bool
) – if True, the figures are included in the summary- Returns:
markdown text
- get_all_figures() Sequence[LatexFigure] [source]#
Retrieve all figures (num, images, caption, label) from a document
- Parameters:
content – the document content
- Returns:
sequence of LatexFigure objects
- highlight_authors_in_list(hl_list: Sequence[str], verbose: bool = False)[source]#
highlight all authors of the paper that match lst entries
- Parameters:
hl_list – list of authors to highlight
verbose (
bool
) – display matching information if set
- select_arxivertag_figures()[source]#
Finds the figures references by the arxivertag
- Returns:
list of selected figures
- select_most_cited_figures(N: int = 4)[source]#
Finds the number of references to each figure and select the N most cited ones
- Parameters:
N (
int
) – number of figures to select- Returns:
list of selected figures
- property short_authors: Sequence[str]#
Make a short author list if there are more authors than nmin This means <first author> et al., – incl <list of hihglighted authors>
- Parameters:
authors – the list of authors
nmin – the minimum number of authors to switch to short representation
- Returns:
the list of authors
- property title: str#
All figures from the paper
- class LatexFigure(**data)[source]#
Bases:
dict
Representation of a figure from a LatexDocument
A dictionary-like structure that contains: - num: figure number - caption: figure caption - label: figure label - images: list of images
- clear_latex_comments(data: str) str [source]#
clean text from any comment
- Return type:
str
- Parameters:
data (
str
) – text to clean- Returns:
cleaned text
- convert_eps_to_image(fname: str) str [source]#
Convert image from EPS to png.
The new image is stored with the original one
- Return type:
str
- Parameters:
fname (
str
) – file to potentially convert
- convert_pdf_to_image(fname: str) str [source]#
Convert image from PDF to png.
The new image is stored with the original one
- Return type:
str
- Parameters:
fname (
str
) – file to potentially convert
- figure_fallback(source: str) Sequence[str] [source]#
When TexSoup fails, falls back procedure into pure regex to parse the paper :type source:
str
:param source: latex source to parse :return: list of figure or figure* environments found in the source.
- find_graphics(where: str, image: str, folder: str = '', attempt_recover_extension: bool = True) str [source]#
Find graphics files for the figure if graphicspath provided
- find_main_doc(folder: str) str | Sequence[str] [source]#
Attempt to find which TeX file is the main document.
- Parameters:
folder (
str
) – folder containing the document- Returns:
filename of the main document
- fix_def_command(text: str) str [source]#
Fixing a small bug in TexSoup that defname{}
This function parses the text to add braces if needed defname{} –> def{name}{}
- force_macros_mathmode(text: str, macros: Sequence[str]) str [source]#
Make sure that detected macros are in math mode. They sometimes are not
- force_mathmode(node)[source]#
Force all tex commands in the node to be in mathenv
it also checks if not already in mathmode to avoid issues.
- get_arxivertag(source: str) list [source]#
Retrieve the arxiver tag if any
To specify which figures authors want to appear alongside your paper, they can leave a comment in any .tex file as in this example:
%@ardddxiver{fig1, fig2, fig3}
see: https://arxiver.moonhats.com/
- Return type:
list
- Parameters:
source (
str
) – latex source- Returns:
list of tagged figures
- get_content(source: str, flexible: bool = True, verbose: bool = False) TexNode [source]#
get soup to parse the source and try to recover if something goes wrong.
As we do not need the exact text throughout the paper, we can try to isolate potential error sections. The following attempts to remove the line that triggers an error.
- get_content_per_section(source: str, flexible: bool = True, verbose: bool = True) Sequence [source]#
Find problematic portions of the document and attempt to skip them
- get_macros_names(macros: Sequence[str]) Sequence[str] [source]#
return a list of names from the macros newcommand definitions
- inject_other_sources(maintex: str, texfiles: Sequence[str], verbose: bool = False)[source]#
replace input and include commands by the content of the sub-files
- replace(self, child, *nodes)[source]#
Replace provided node with node(s).
- Parameters:
child (TexNode) – Child node to replace
nodes (TexNode) – List of nodes to subtitute in
>>> from TexSoup import TexSoup >>> soup = TexSoup(r''' ... \begin{itemize} ... \item Hello ... \item Bye ... \end{itemize}''') >>> items = list(soup.find_all('item')) >>> bye = items[1] >>> soup.itemize.replace(soup.item, bye) >>> soup.itemize \begin{itemize} \item Bye \item Bye \end{itemize}
- select_most_cited_figures(figures: Sequence[LatexFigure], content: dict, N: int = 3) Sequence[LatexFigure] [source]#
Finds the number of references to each figure and select the N most cited ones
- Parameters:
figures – list of all figures
content (
dict
) – paper content from TexSoupN (
int
) – number of figures to select
- Returns:
list of selected figures
arxiv_on_deck_2.latex_bib module#
- class LatexBib(bibdata: BibliographyData)[source]#
Bases:
object
A small interface to pybtex to handle bibliography entries
- classmethod from_doc(doc: LatexDocument)[source]#
Create from a LatexDocument object
First check if there is any .bbl file with the document, if not attempts to read the .bib file instead.
- Parameters:
doc (
LatexDocument
) – the document to link with- Returns:
LatexBib object
TODO: extract bibitems entries from main doc if any
- get_citation_md(key: str | Entry, kind: str = 'cite', max_authors: int = 3) str [source]#
Return formatted markdown string
examples: * kind=”citet” -> [author1, et al. (2023)](url) * kind=”cite” -> [author1 and author2 2023](url)
- Return type:
str
- Parameters:
key – key to extract from
kind (
str
) – the kind of latex citation (expecting cite, citealt, citet, citep)max_authors (
int
) – the number of authors to allow before “et. al” abbrv.
- Returns:
the formatted citation text
- get_citation_text(key: str | Entry, kind: str = 'cite', max_authors: int = 3) str [source]#
Return formatted text
examples: * kind=”citet” -> author1, et al. (2023) * kind=”cite” -> author1 and author2 2023
- Return type:
str
- Parameters:
key – key to extract from
kind (
str
) – the kind of latex citation (expecting cite, citealt, citet, citep)max_authors (
int
) – the number of authors to allow before “et. al” abbrv.
- Returns:
the formatted citation text
- get_short_authors(key: str | Entry, max_authors: int = 3) str [source]#
Returns the astro style author list (e.g., bob et al.)
- Return type:
str
- Parameters:
key – key to extract from
max_authors (
int
) – the number of authors to allow before “et. al” abbrv.
- Returns:
short string of authors (e.g., lada and lada, a, b and c, bob et al)
- clean_special_characters(source: str) str [source]#
Replace latex macros of special characters (accents etc) for their unicode alternatives
- Return type:
str
- Parameters:
source (
str
) – bibitem raw string definition- Returns:
transformed bibitem string definition
- merge_BibliographyData(dbs: Sequence[BibliographyData]) BibliographyData [source]#
Merge BibliographyData objects
- Return type:
BibliographyData
- Parameters:
dbs – Sequence of bibliographic data objects
- Returns:
single bibliographic data with all entries from dbs
- parse_bbl(fname: str) BibliographyData [source]#
Parse bibliographic information from bbl file (compiled bibliography)
- Return type:
BibliographyData
- Parameters:
fname (
str
) – filename to read the data from- Returns:
biblio data object
- replace_citations(full_md: str, bibdata: LatexBib, kind='all', raise_exceptions: bool = False)[source]#
Parse and replace citex calls remaining in the Markdown text
- Parameters:
full_md (
str
) – Markdown documentbibdata (
LatexBib
) – the bibliographic datakind (
str
) – which of citex macros (all, citet, citep, citealt)raise_exceptions (
bool
) – set to block if a citation is raising issues
- Returns:
updated content
arxiv_on_deck_2.mpia module#
MPIA related functions.
This module handles the list of MPIA authors to monitor. Eventually Scientists should provide their publication names.
- affiliation_verifications(content: str, word_list: Sequence[str] | None = None, verbose: bool = False) bool [source]#
- Check if specific keywords are present
to make sure at least one author is MPIA. Test is case insensitive but all words must appear.
- Return type:
bool
- Parameters:
content (
str
) – text to checkword_list – list of words required for verification
verbose (
bool
) – print information
- Returns:
True if all words are present
- consider_variations(name: str) str [source]#
Consider a name with the usual character replacements :rtype:
str
:type name:str
:param name: name :returns: name with replacements
- family_name_from_initials(initials: str) str [source]#
Get family name only :rtype:
str
:type initials:str
:param initials: name with initials :returns: family name
- filter_non_scientists(name: str) bool [source]#
Loose filter on expected authorships
removing IT, administration, technical staff :rtype:
bool
:type name:str
:param name: name :returns: False if name is not a scientist
- get_initials(name: str) str [source]#
Get the short name, e.g., A.-B. FamName :rtype:
str
:type name:str
:param name: full name :returns: initials
- get_mpia_mitarbeiter_list() Sequence[str] [source]#
Get the main filtered list :returns: list of names (family name, full names, initials)
- get_special_corrections(initials_name: str) str [source]#
Handle non-generic cases of initials :rtype:
str
:type initials_name:str
:param initials_name: name with initials :returns: name with corrected initials