arxiv_on_deck_2 package#

Submodules#

arxiv_on_deck_2.arxiv module#

class ArXivPaper(identifier: str = '', highlight_authors: Sequence[str] | None = None, appearedon: str | None = None)[source]#

Bases: object

Class that handles the interface to Arxiv website paper abstract

Represents a paper from ArXiv.

The main URLs are:

Parameters:
  • identifier – the identifier of the paper

  • highlight_authors – a list of authors to highlight

  • appearedon – the date the paper appeared on

property authors: str#

Return the authors of the paper

Returns:

the authors of the paper

classmethod from_identifier(identifier: str)[source]#

Instanciate from paper identifier

Parameters:

identifier (str) – the identifier of the paper

property short_authors: str#

Return a short list of authors (e.g., <first author, et al.>)

class ArxivListHTMLParser(*args, **kwargs)[source]#

Bases: HTMLParser

generates a list of Paper items by parsing the Arxiv new page

handle_data(data)[source]#

Data encountered

handle_endtag(tag)[source]#

End of tag encountered

handle_starttag(tag, attrs)[source]#

New tag encountered

filter_papers(papers: Sequence[ArXivPaper], fname_list: Sequence[str]) Sequence[ArXivPaper][source]#

Extract papers when an author match is found

Parameters:
  • papers – paper list

  • fname_list – authors to search

Returns:

papers with matching author

get_new_papers(skip_replacements: bool = True) Sequence[ArXivPaper][source]#

retrieve the new list from the website.

Parameters:

skip_replacements (bool, optional) – set to skip parsing the replacements, defaults to True

Returns:

list of ArXivPaper objects

Return type:

Sequence[ArXivPaper]

arxiv_on_deck_2.arxiv2 module#

How to deal with ArXiv and getting papers’ information and sources

class ArxivPaper(**paper_data)[source]#

Bases: dict

Paper representation using ArXiv information

A dictionary like structure that contains:

  • identifier: the arxiv identification number

  • title: the title of the paper

  • authors: the authors of the paper

  • comments: the comments of the paper

  • abstract: the abstract of the paper

classmethod from_bs4_tags(dt: Tag, dd: Tag)[source]#

extract paper information from its pair of tags

Parameters:
  • dt (Tag) – the tag of the title

  • dd (Tag) – the tag of the description

Returns:

the paper object

generate_markdown_text() str[source]#

Generate the markdown text of this paper

Returns:

the markdown text summary of the paper

property short_authors: Sequence[str]#

Make a short author list if there are more authors than nmin This means <first author> et al., – incl <list of hihglighted authors>

Parameters:
  • authors – the list of authors

  • nmin – the minimum number of authors to switch to short representation

Returns:

the list of authors

get_markdown_badge(identifier: str) str[source]#

Generate the markdown badge for a paper

Return type:

str

Parameters:

identifier (str) – arxiv identifier of the paper

Returns:

markdown badge

get_new_papers() Sequence[ArxivPaper][source]#

retrieve the new list from the website.

Returns:

list of ArXivPaper objects

get_paper_from_identifier(paper_identifier: str) ArxivPaper[source]#

Retrieve a paper from Arxiv using its identifier

Return type:

ArxivPaper

Parameters:

paper_identifier (str) – arxiv identifier of the paper

Returns:

Paper object

retrieve_document_source(identifier: str, directory: str) str[source]#

Retrieve document source tarball and extract it.

Return type:

str

Parameters:
  • identifier (str) – Paper identification number from Arxiv

  • directory (str) – where to store the extracted files

Returns:

directory in which the data was extracted

arxiv_on_deck_2.arxiv_vanity module#

Interface to Arxiv Vanity

https://www.arxiv-vanity.com/

access a paper through the following URL: https://www.arxiv-vanity.com/papers/{arxiv_id}

If the paper is not already stored, it will be processed.

author_match(author: str, hl_list: Sequence[str], verbose=False) Sequence[str][source]#

Matching author names with a family name reference list

Parameters:
  • author (str) – the author string to check

  • hl_list – the list of reference authors to match

  • verbose (bool) – prints matching results if set

Returns:

the matching sequences or empty sequence if None

collect_summary_information(identifiers: Sequence[str], content_requirements: callable | None = None, wait: int = 10) Sequence[dict][source]#

Extract necessary information from the vanity webpage

Parameters:
  • identifers – list of arxiv identifiers to attempt to retrieve

  • content_requirement – filter function that returns False if the paper does not meet the requirements

  • wait (int) – how many seconds to wait between retries.

Returns:

a dictionary with the following keys: (title, authors, abstract, paper_id, url, figures, and the soup object)

generate_markdown_text(content: dict) str[source]#

Generate the summary markdown content

Return type:

str

Parameters:

content (dict) – the content of the paper

Returns:

the markdown representation of the paper

get_arxiv_vanity_badge(identifier: str) str[source]#

Generate the markdown badge for a paper

Return type:

str

Parameters:

identifier (str) – arxiv identifier of the paper

Returns:

markdown badge

highlight_author(author_list: Sequence[str], author: str) Sequence[str][source]#

Highlight a particular author in the list of authors

Parameters:
  • author_list – the list of authors

  • author (str) – the author to highlight

Returns:

the list of authors with the highlighted author

highlight_authors_in_list(author_list: Sequence[str], hl_list: Sequence[str], verbose: bool = False) Sequence[str][source]#

highlight all authors of the paper that match lst entries

Parameters:
  • author_list – the list of authors

  • hl_list – the list of authors to highlight

  • verbose (bool) – prints matching results if set

Returns:

the list of authors with the highlighted authors

make_short_author_list(authors: Sequence[str], nmin: int = 4) Sequence[str][source]#

Make a short author list if there are more authors than nmin This means <first author> et al., – incl <list of hihglighted authors>

Parameters:
  • authors – the list of authors

  • nmin (int) – the minimum number of authors to switch to short representation

Returns:

the list of authors

select_most_cited_figures(content: dict, N: int = 3) Sequence[source]#

Finds the number of references to each figure and select the N most cited ones

Parameters:
  • content (dict) – the content of the paper

  • N (int) – the number of figures to select

Returns:

a list of N figures

arxiv_on_deck_2.latex module#

class LatexDocument(folder: str, validation: callable | None = None, debug: bool = False)[source]#

Bases: object

Handles the latex document interface.

Allows to extract title, authors, figures, abstract

Parameters:
  • folder – folder containing the document

  • main_file – name of the main document

  • content – the document content from TexSoup

  • title – the title of the paper

  • authors – the authors of the paper

  • comments – the comments of the paper

  • abstract – the abstract of the paper

property abstract: str#

All figures from the paper

property authors: str#

All figures from the paper

property figures: Sequence[LatexFigure]#

All figures from the paper

generate_markdown_text(with_figures: bool = True) str[source]#

Generate the markdown summary

Return type:

str

Parameters:

with_figures (bool) – if True, the figures are included in the summary

Returns:

markdown text

get_abstract() str[source]#

Extract abstract from document

get_all_figures() Sequence[LatexFigure][source]#

Retrieve all figures (num, images, caption, label) from a document

Parameters:

content – the document content

Returns:

sequence of LatexFigure objects

get_authors() Sequence[str][source]#

Get list of authors

get_graphicspath() Sequence[str][source]#

Retrieve the graphicspath if declared

get_macros_markdown_text() str[source]#

Construct the Markdown object of the macros

get_texfiles()[source]#

returns all tex files in the folder (and subfolders)

get_title() str[source]#

Extract document’s title

highlight_authors_in_list(hl_list: Sequence[str], verbose: bool = False)[source]#

highlight all authors of the paper that match lst entries

Parameters:
  • hl_list – list of authors to highlight

  • verbose (bool) – display matching information if set

retrieve_latex_macros() Sequence[str][source]#

Get the macros defined in the document

select_arxivertag_figures()[source]#

Finds the figures references by the arxivertag

Returns:

list of selected figures

select_most_cited_figures(N: int = 4)[source]#

Finds the number of references to each figure and select the N most cited ones

Parameters:

N (int) – number of figures to select

Returns:

list of selected figures

property short_authors: Sequence[str]#

Make a short author list if there are more authors than nmin This means <first author> et al., – incl <list of hihglighted authors>

Parameters:
  • authors – the list of authors

  • nmin – the minimum number of authors to switch to short representation

Returns:

the list of authors

property title: str#

All figures from the paper

class LatexFigure(**data)[source]#

Bases: dict

Representation of a figure from a LatexDocument

A dictionary-like structure that contains: - num: figure number - caption: figure caption - label: figure label - images: list of images

generate_markdown_text()[source]#

Generate the markdown summary

Returns:

markdown text

exception LatexWarning[source]#

Bases: UserWarning

clear_latex_comments(data: str) str[source]#

clean text from any comment

Return type:

str

Parameters:

data (str) – text to clean

Returns:

cleaned text

convert_eps_to_image(fname: str) str[source]#

Convert image from EPS to png.

The new image is stored with the original one

Return type:

str

Parameters:

fname (str) – file to potentially convert

convert_pdf_to_image(fname: str) str[source]#

Convert image from PDF to png.

The new image is stored with the original one

Return type:

str

Parameters:

fname (str) – file to potentially convert

drop_none_from_list(list_: Sequence) Sequence[source]#

Remove None from a list

figure_fallback(source: str) Sequence[str][source]#

When TexSoup fails, falls back procedure into pure regex to parse the paper :type source: str :param source: latex source to parse :return: list of figure or figure* environments found in the source.

find_graphics(where: str, image: str, folder: str = '', attempt_recover_extension: bool = True) str[source]#

Find graphics files for the figure if graphicspath provided

find_main_doc(folder: str) str | Sequence[str][source]#

Attempt to find which TeX file is the main document.

Parameters:

folder (str) – folder containing the document

Returns:

filename of the main document

fix_def_command(text: str) str[source]#

Fixing a small bug in TexSoup that defname{}

This function parses the text to add braces if needed defname{} –> def{name}{}

alvinwan/TexSoup#131

force_macros_mathmode(text: str, macros: Sequence[str]) str[source]#

Make sure that detected macros are in math mode. They sometimes are not

force_mathmode(node)[source]#

Force all tex commands in the node to be in mathenv

it also checks if not already in mathmode to avoid issues.

get_arxivertag(source: str) list[source]#

Retrieve the arxiver tag if any

To specify which figures authors want to appear alongside your paper, they can leave a comment in any .tex file as in this example:

%@ardddxiver{fig1, fig2, fig3}

see: https://arxiver.moonhats.com/

Return type:

list

Parameters:

source (str) – latex source

Returns:

list of tagged figures

get_content(source: str, flexible: bool = True, verbose: bool = False) TexNode[source]#

get soup to parse the source and try to recover if something goes wrong.

As we do not need the exact text throughout the paper, we can try to isolate potential error sections. The following attempts to remove the line that triggers an error.

get_content_per_section(source: str, flexible: bool = True, verbose: bool = True) Sequence[source]#

Find problematic portions of the document and attempt to skip them

get_macros_names(macros: Sequence[str]) Sequence[str][source]#

return a list of names from the macros newcommand definitions

inject_other_sources(maintex: str, texfiles: Sequence[str], verbose: bool = False)[source]#

replace input and include commands by the content of the sub-files

open_eps(filename, dpi=300.0)[source]#
replace(self, child, *nodes)[source]#

Replace provided node with node(s).

Parameters:
  • child (TexNode) – Child node to replace

  • nodes (TexNode) – List of nodes to subtitute in

>>> from TexSoup import TexSoup
>>> soup = TexSoup(r'''
... \begin{itemize}
...     \item Hello
...     \item Bye
... \end{itemize}''')
>>> items = list(soup.find_all('item'))
>>> bye = items[1]
>>> soup.itemize.replace(soup.item, bye)
>>> soup.itemize
\begin{itemize}
    \item Bye
\item Bye
\end{itemize}
select_most_cited_figures(figures: Sequence[LatexFigure], content: dict, N: int = 3) Sequence[LatexFigure][source]#

Finds the number of references to each figure and select the N most cited ones

Parameters:
  • figures – list of all figures

  • content (dict) – paper content from TexSoup

  • N (int) – number of figures to select

Returns:

list of selected figures

tex2md(latex: str) str[source]#

Replace some obvious tex commands to their markdown equivalent

arxiv_on_deck_2.latex_bib module#

class LatexBib(bibdata: BibliographyData)[source]#

Bases: object

A small interface to pybtex to handle bibliography entries

classmethod from_doc(doc: LatexDocument)[source]#

Create from a LatexDocument object

First check if there is any .bbl file with the document, if not attempts to read the .bib file instead.

Parameters:

doc (LatexDocument) – the document to link with

Returns:

LatexBib object

TODO: extract bibitems entries from main doc if any

get_citation_md(key: str | Entry, kind: str = 'cite', max_authors: int = 3) str[source]#

Return formatted markdown string

examples: * kind=”citet” -> [author1, et al. (2023)](url) * kind=”cite” -> [author1 and author2 2023](url)

Return type:

str

Parameters:
  • key – key to extract from

  • kind (str) – the kind of latex citation (expecting cite, citealt, citet, citep)

  • max_authors (int) – the number of authors to allow before “et. al” abbrv.

Returns:

the formatted citation text

get_citation_text(key: str | Entry, kind: str = 'cite', max_authors: int = 3) str[source]#

Return formatted text

examples: * kind=”citet” -> author1, et al. (2023) * kind=”cite” -> author1 and author2 2023

Return type:

str

Parameters:
  • key – key to extract from

  • kind (str) – the kind of latex citation (expecting cite, citealt, citet, citep)

  • max_authors (int) – the number of authors to allow before “et. al” abbrv.

Returns:

the formatted citation text

get_short_authors(key: str | Entry, max_authors: int = 3) str[source]#

Returns the astro style author list (e.g., bob et al.)

Return type:

str

Parameters:
  • key – key to extract from

  • max_authors (int) – the number of authors to allow before “et. al” abbrv.

Returns:

short string of authors (e.g., lada and lada, a, b and c, bob et al)

get_url(key: str | Entry) str[source]#

Extract if possible the URL of the bib entry It will attempt to use the url, or adsurl, or doi entries in that order.

Return type:

str

Parameters:

key – key to extract from

Returns:

string of the url (or empty string)

get_year(key: str | Entry) str[source]#

return refrence’s publication year

Return type:

str

Parameters:

key – key to extract from

Returns:

the publication year (entry.field[‘year’])

clean_special_characters(source: str) str[source]#

Replace latex macros of special characters (accents etc) for their unicode alternatives

Return type:

str

Parameters:

source (str) – bibitem raw string definition

Returns:

transformed bibitem string definition

merge_BibliographyData(dbs: Sequence[BibliographyData]) BibliographyData[source]#

Merge BibliographyData objects

Return type:

BibliographyData

Parameters:

dbs – Sequence of bibliographic data objects

Returns:

single bibliographic data with all entries from dbs

parse_bbl(fname: str) BibliographyData[source]#

Parse bibliographic information from bbl file (compiled bibliography)

Return type:

BibliographyData

Parameters:

fname (str) – filename to read the data from

Returns:

biblio data object

replace_citations(full_md: str, bibdata: LatexBib, kind='all', raise_exceptions: bool = False)[source]#

Parse and replace citex calls remaining in the Markdown text

Parameters:
  • full_md (str) – Markdown document

  • bibdata (LatexBib) – the bibliographic data

  • kind (str) – which of citex macros (all, citet, citep, citealt)

  • raise_exceptions (bool) – set to block if a citation is raising issues

Returns:

updated content

arxiv_on_deck_2.mpia module#

MPIA related functions.

This module handles the list of MPIA authors to monitor. Eventually Scientists should provide their publication names.

affiliation_verifications(content: str, word_list: Sequence[str] | None = None, verbose: bool = False) bool[source]#
Check if specific keywords are present

to make sure at least one author is MPIA. Test is case insensitive but all words must appear.

Return type:

bool

Parameters:
  • content (str) – text to check

  • word_list – list of words required for verification

  • verbose (bool) – print information

Returns:

True if all words are present

consider_variations(name: str) str[source]#

Consider a name with the usual character replacements :rtype: str :type name: str :param name: name :returns: name with replacements

family_name_from_initials(initials: str) str[source]#

Get family name only :rtype: str :type initials: str :param initials: name with initials :returns: family name

filter_non_scientists(name: str) bool[source]#

Loose filter on expected authorships

removing IT, administration, technical staff :rtype: bool :type name: str :param name: name :returns: False if name is not a scientist

get_initials(name: str) str[source]#

Get the short name, e.g., A.-B. FamName :rtype: str :type name: str :param name: full name :returns: initials

get_mpia_mitarbeiter_list() Sequence[str][source]#

Get the main filtered list :returns: list of names (family name, full names, initials)

get_special_corrections(initials_name: str) str[source]#

Handle non-generic cases of initials :rtype: str :type initials_name: str :param initials_name: name with initials :returns: name with corrected initials

parse_mpia_staff_list() Sequence[str][source]#

Parse the multi-page table from the MPIA website and returns the name column :returns: list of names (full names)

strip_titles(name: str) str[source]#

Remove any title from name which could mess up with author parsing :returns: cleaned name

arxiv_on_deck_2.version module#

Module contents#